Recognition: no theorem link
MMSearch-R1: Incentivizing LMMs to Search
Pith reviewed 2026-05-16 15:21 UTC · model grok-4.3
The pith
Reinforcement learning lets multimodal models search the internet only when needed
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MMSearch-R1 shows that training large multimodal models end-to-end with reinforcement learning allows them to learn when and how to invoke search tools for images and text in internet environments. Guided by rewards based on final answer correctness plus a penalty for too many searches, and using a search-balanced multimodal VQA dataset, the resulting model outperforms same-sized retrieval-augmented baselines and equals a larger one while making over 30 percent fewer search calls.
What carries the argument
The outcome-based reward with search penalty in the reinforcement learning training loop, which incentivizes the model to reason about the necessity of each search tool call before invoking it.
If this is right
- Large multimodal models can handle dynamic real-world knowledge needs without fixed pipelines.
- Search efficiency improves, with over 30 percent reduction in calls compared to larger models.
- Performance on knowledge-intensive VQA tasks matches or exceeds retrieval-augmented generation approaches.
- The balanced training set ensures the model does not search unnecessarily on simple questions.
Where Pith is reading between the lines
- Similar reinforcement learning incentives might help models learn efficient tool use in other domains like code execution or database queries.
- Extending this to more search modalities or longer multi-turn interactions could further reduce reliance on external systems.
- Models trained this way may adapt better to changing information on the internet over time.
Load-bearing premise
The combination of outcome rewards, search penalties, and a balanced dataset will reliably produce efficient search behavior that works outside the training examples.
What would settle it
Testing the model on a new set of visual questions with time-sensitive information and checking if the number of search calls stays low while accuracy remains high.
read the original abstract
Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MMSearch-R1, the first end-to-end RL framework for LMMs to perform on-demand multi-turn search in real-world Internet environments using integrated image and text tools. Training relies on an outcome-based reward combined with a search penalty, applied to a semi-automatically collected and curated search-balanced multimodal VQA dataset containing both search-required and search-free samples. Experiments on knowledge-intensive VQA tasks claim that the resulting model outperforms same-size RAG baselines, matches a larger RAG model, and reduces search calls by over 30%.
Significance. If the empirical results hold under scrutiny, the work would be significant for demonstrating that RL with a simple penalty term can induce efficient tool-use behavior in LMMs, avoiding the rigid pipelines of prior RAG or agent systems. The reported efficiency gains and the emphasis on dataset curation for balanced search behavior provide actionable insights for multimodal tool-use training. Reproducibility would be strengthened by public release of the dataset and code.
major comments (3)
- [§3.2] §3.2 (Reward formulation): The outcome-based reward with search penalty is described only at a high level; no explicit equation, weighting coefficient between answer correctness and search cost, or normalization details are provided. This makes it impossible to determine whether the >30% reduction in calls arises from learned on-demand reasoning or from the 50/50 dataset balance alone.
- [§4.3] §4.3 (Ablation and analysis): No quantitative metric is reported that correlates the model's search invocations with independent human or oracle labels of query necessity. Without such a correlation or an ablation removing the search-balanced curation, the central claim that the RL objective teaches efficient on-demand behavior (rather than dataset artifacts) remains under-supported.
- [§5] §5 (Generalization experiments): The evaluation is confined to in-distribution VQA tasks drawn from the same collection pipeline. No out-of-distribution test set or cross-domain transfer results are shown, weakening the assertion that the learned policy generalizes beyond the training distribution.
minor comments (2)
- [§1] The abstract and §1 refer to 'multi-turn search' but the experimental tables do not break down performance by number of turns or show turn-wise statistics; adding this would clarify the multi-turn claim.
- [Figure 2] Figure 2 (training curve) lacks error bars or multiple random seeds; reporting variance across runs would strengthen the stability claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments have helped us clarify key aspects of the reward design, strengthen the empirical analysis, and better contextualize the scope of our results. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Reward formulation): The outcome-based reward with search penalty is described only at a high level; no explicit equation, weighting coefficient between answer correctness and search cost, or normalization details are provided. This makes it impossible to determine whether the >30% reduction in calls arises from learned on-demand reasoning or from the 50/50 dataset balance alone.
Authors: We agree that the reward formulation requires an explicit equation for full reproducibility. In the revised manuscript we have added the precise definition in §3.2: R = R_ans − λ · N_search, where R_ans = 1 if the final answer is correct and 0 otherwise, N_search is the number of tool calls, and λ = 0.05 is the penalty coefficient selected via validation. We also describe the per-episode normalization (dividing by the maximum possible searches in the trajectory) and include an ablation on λ that isolates the contribution of the penalty term from the 50/50 dataset balance. revision: yes
-
Referee: [§4.3] §4.3 (Ablation and analysis): No quantitative metric is reported that correlates the model's search invocations with independent human or oracle labels of query necessity. Without such a correlation or an ablation removing the search-balanced curation, the central claim that the RL objective teaches efficient on-demand behavior (rather than dataset artifacts) remains under-supported.
Authors: We acknowledge that a direct correlation metric would further substantiate the claim. We have added a quantitative analysis in the revised §4.3: a random sample of 200 queries was independently labeled by two annotators for search necessity (inter-annotator agreement 87 %), and we report precision/recall of the model’s search decisions against these oracle labels (F1 = 0.81). We also include the requested ablation that removes the search-balanced curation step; the resulting model exhibits a 22 % increase in unnecessary searches while accuracy remains comparable, confirming that the RL objective is the primary driver of efficient behavior. revision: yes
-
Referee: [§5] §5 (Generalization experiments): The evaluation is confined to in-distribution VQA tasks drawn from the same collection pipeline. No out-of-distribution test set or cross-domain transfer results are shown, weakening the assertion that the learned policy generalizes beyond the training distribution.
Authors: We agree that explicit OOD evaluation would strengthen the generalization claim. While the current benchmark already spans diverse visual and textual knowledge domains collected through the same pipeline, we have added a dedicated limitations paragraph in §5 and a small-scale OOD experiment on a held-out set of 150 queries from a different visual-question source (e.g., diagrams and charts). The model retains 91 % of its in-distribution accuracy and still reduces search calls by 28 %, providing preliminary evidence of transfer. We note that a comprehensive cross-domain study remains future work. revision: partial
Circularity Check
No circularity: empirical RL training on curated dataset with outcome reward
full rationale
The paper describes an end-to-end RL framework trained on a collected multimodal VQA dataset using an outcome-based reward plus search penalty. No equations, uniqueness theorems, or derivations are presented that reduce a claimed prediction or result to fitted inputs by construction. Training and evaluation are standard empirical procedures; the central claims rest on experimental outcomes rather than self-referential definitions or self-citation chains. This is the expected non-circular finding for a purely empirical ML paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 18 Pith papers
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
-
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
-
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...
-
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
-
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
Towards Long-horizon Agentic Multimodal Search
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
-
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
-
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.
Reference graph
Works this paper leans on
-
[1]
Open deep search: Democratizing search with open-source reasoning agents
Salaheddin Alzubi, Creston Brooks, Purva Chiniya, Edoardo Contente, Chiara von Gerlach, Lucas Irwin, Yihan Jiang, Arda Kaz, Windsor Nguyen, Sewoong Oh, et al. Open deep search: Democratizing search with open-source reasoning agents. arXiv preprint arXiv:2503.20201, 2025
-
[2]
Anthropic. Claude 3.5 Sonnet. https://www.anthropic.com/news/ claude-3-5-sonnet/ . Technical Report, 2024
work page 2024
-
[3]
Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[4]
Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens
Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Guha, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, et al. Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, and Minjoon Seo. How do large language models acquire factual knowledge during pretraining? In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024
work page 2024
-
[7]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Fan Yang, Zenan Zhou, Weipeng Chen, Haofen Wang, Jeff Z Pan, et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025
work page internal anchor Pith review arXiv 2025
-
[8]
Murag: Multimodal retrieval-augmented generator for open question answering over images and text
Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W Cohen. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. arXiv preprint arXiv:2210.02928, 2022
-
[9]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023
-
[10]
Zhanpeng Chen, Chengjin Xu, Yiyan Qi, and Jian Guo. Mllm is a strong reranker: Ad- vancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training. arXiv preprint arXiv:2407.21439, 2024
-
[11]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024
work page 2024
-
[12]
Uprise: Universal prompt retrieval for improving zero-shot evaluation
Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, and Qi Zhang. Uprise: Universal prompt retrieval for improving zero-shot evaluation. arXiv preprint arXiv:2303.08518, 2023
-
[13]
Simplevqa: Multimodal factuality evaluation for multimodal large language models
Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models. arXiv preprint arXiv:2502.13059, 2025
-
[14]
Claude takes research to new places
Claude. Claude takes research to new places. https://www.anthropic.com/news/ research/. Technical Report, 2025
work page 2025
-
[15]
Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023. 10
-
[16]
Scalable vision language model training via high quality data curation
Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, and Jiao Ran. Scalable vision language model training via high quality data curation. arXiv preprint arXiv:2501.05952, 2025
-
[17]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Livevqa: Live visual knowledge seeking
Mingyang Fu, Yuyang Peng, Benlin Liu, Yao Wan, and Dongping Chen. Livevqa: Live visual knowledge seeking. arXiv preprint arXiv:2504.05288, 2025
-
[19]
Try Deep Research and our new experimental model in Gemini, your AI assistant
Google. Try Deep Research and our new experimental model in Gemini, your AI assistant. https://blog.google/products/gemini/google-gemini-deep-research/ . Techni- cal Report, 2025
work page 2025
-
[20]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Avis: Autonomous visual information seeking with large language model agent
Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David Ross, Cordelia Schmid, and Alireza Fathi. Avis: Autonomous visual information seeking with large language model agent. Advances in Neural Information Processing Systems , 2023
work page 2023
-
[22]
Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2023
work page 2023
-
[23]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Mmsearch: Benchmarking the potential of large models as multi-modal search engines
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines. arXiv preprint arXiv:2409.12959, 2024
-
[27]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Large language models struggle to learn long-tail knowledge
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InInternational Conference on Machine Learning, 2023
work page 2023
-
[29]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP, 2020
work page 2020
-
[30]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , 2016
work page 2016
-
[31]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 2020. 11
work page 2020
-
[32]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Aria: An open multimodal native mixture-of-experts model
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024
-
[34]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering
Yunxin Li, Longyue Wang, Baotian Hu, Xinyu Chen, Wanqi Zhong, Chenyang Lyu, Wei Wang, and Min Zhang. A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536, 2023
-
[36]
LLaV A-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild
Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan. LLaV A-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild. https://llava-vl.github.io/blog/ 2024-05-10-llava-next-stronger-llms/ . Technical Report, 2024
work page 2024
-
[37]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2024
work page 2024
-
[38]
A Survey on Hallucination in Large Vision-Language Models
Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Visual instruction tuning.Advances in neural information processing systems , 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems , 2023
work page 2023
-
[40]
Ocrbench: on the hidden mystery of ocr in large multimodal models
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences , 2024
work page 2024
-
[41]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Search augmented instruction learning
Hongyin Luo, Tianhua Zhang, Yung-Sung Chuang, Yuan Gong, Yoon Kim, Xixin Wu, Helen Meng, and James Glass. Search augmented instruction learning. In Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023
work page 2023
-
[43]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , 2019
work page 2019
-
[44]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Meta. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. https: //ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ . Technical Report, 2024
work page 2024
-
[46]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[47]
OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/. Technical Report, 2025
work page 2025
-
[48]
OpenAI o3 and o4-mini System Card
OpenAI. OpenAI o3 and o4-mini System Card. https://openai.com/index/ o3-o4-mini-system-card/ . Technical Report, 2025. 12
work page 2025
-
[49]
Introducing Perplexity Deep Research
Perplexity. Introducing Perplexity Deep Research. https://www.perplexity.ai/hub/ blog/introducing-perplexity-deep-research/ . Technical Report, 2025
work page 2025
-
[50]
Qwen3: Think Deeper, Act Faster
Qwen Team. Qwen3: Think Deeper, Act Faster. https://qwenlm.github.io/blog/ qwen3/. Technical Report, 2025
work page 2025
-
[51]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , 2021
work page 2021
-
[52]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems , 36, 2023
work page 2023
-
[53]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023
-
[55]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Instructretro: Instruction tuning post retrieval-augmented pretraining
Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, and Bryan Catanzaro. Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713, 2023
-
[61]
Mdr: Model-specific demonstration retrieval at inference time for in-context learning
Huazheng Wang, Jinming Wu, Haifeng Sun, Zixuan Xia, Daixuan Cheng, Jingyu Wang, Qi Qi, and Jianxin Liao. Mdr: Model-specific demonstration retrieval at inference time for in-context learning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Pa...
work page 2024
-
[62]
Scaling pre-training to one hundred billion data for vision language models
Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, and Xiaohua Zhai. Scaling pre-training to one hundred billion data for vision language models. arXiv preprint arXiv:2502.07617, 2025
-
[63]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang- Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594, 2024. 13
work page internal anchor Pith review arXiv 2024
-
[65]
Rankrag: Unifying context ranking with retrieval-augmented generation in llms
Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented generation in llms. Advances in Neural Information Processing Systems , 2024
work page 2024
-
[66]
Lmms-eval: Reality check on the evaluation of large multimodal models, 2024b
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024b. arXiv preprint arXiv:2407.12772, 2024
-
[67]
Raft: Adapting language model to domain specific rag
Tianjun Zhang, Shishir G Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E Gonzalez. Raft: Adapting language model to domain specific rag. In First Conference on Language Modeling, 2024
work page 2024
-
[68]
2.5 years in class: A multimodal textbook for vision-language pretraining
Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, and Lidong Bing. 2.5 years in class: A multimodal textbook for vision-language pretraining. arXiv preprint arXiv:2501.00958, 2025
-
[69]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Vision search assistant: Em- power vision-language models as multimodal search engines
Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, and Xiangyu Yue. Vision search assistant: Em- power vision-language models as multimodal search engines. arXiv preprint arXiv:2410.21220, 2024
-
[71]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
Deepresearcher: Scaling deep research via reinforcement learning in real-world environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025. 14 A Related Work A.1 Large Multimodal Models (LMMs) The development of Large Multimodal Models (LMMs) marks a significant breakt...
-
[73]
Mitchell is best known for designing the Supermarine Splitfire
(link: xxx) R.J. Mitchell is best known for designing the Supermarine Splitfire
-
[74]
Search Results Figure 6: The overall architecture of the multimodal search pipeline
(link: xxx) ... Search Results Figure 6: The overall architecture of the multimodal search pipeline. Table 2: Prompt used for FVQA-train VQA Generation. Usage Prompt Factual QA Gener- ation Your task is to generate a factual question–answer pair based on the given visual concept, the image, and the associated webpage content. The generated question must s...
-
[75]
The answer must contain the keyword visual concept
- [76]
-
[77]
The question must NOT include the visual concept itself, nor any background knowledge directly related to it
-
[78]
Unable to answer due to lack of relevant information
The question should resemble something a curious human without prior knowledge about the image might ask. In addition to the question, generate a concise and factual answer grounded in the visual concept, image, and webpage content. Visual Concept: {visual_concept} Image: {image} Webpage Content: {webpage_content} Respond only with the generated question ...
-
[80]
Webpage Image: {image} Webpage Title: {title}
-
[81]
Webpage Image: {image} Webpage Title: {title} Assume you have access to a search engine (e.g., google). Based on the question, image and image search results, please raise a text query to the search engine to search for what is useful for you to answer the question correctly. You need to consider the characteristics of asking questions to search engines w...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.