pith. machine review for the scientific record. sign in

arxiv: 2604.12890 · v2 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Recognition: unknown

Towards Long-horizon Agentic Multimodal Search

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords long-horizon searchmultimodal agentsvisual representationagentic searchmulti-hop reasoningfile-based storagevision-language modelscontext management
0
0 comments X

The pith

Mapping images to textual UIDs lets multimodal agents sustain 100-turn searches without visual loss or token overload.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LMM-Searcher, a framework that stores visual assets in an external file system and replaces them in the agent's context with short textual identifiers. Agents then retrieve specific images only when needed through a dedicated fetch tool, which keeps context length manageable across many steps. A separate pipeline generates 12,000 synthetic trajectories that emphasize cross-modal multi-hop reasoning, and these are used to fine-tune a base vision-language model. Experiments on four benchmarks show the resulting agent reaches state-of-the-art open-source performance on long-horizon tasks such as MM-BrowseComp and MMSearch-Plus while working across different underlying models.

Core claim

The central claim is that offloading images to a file system, mapping each to a lightweight textual UID, and providing an on-demand fetch-image tool together enable multimodal search agents to operate over 100 iterative turns without context explosion or loss of visual signals. This representation is paired with a data synthesis method that creates 12K trajectories of complex cross-modal reasoning, which are then used to specialize a vision-language model for the search task.

What carries the argument

The file-based visual representation that replaces images with textual UIDs and supplies a fetch-image tool for progressive, on-demand loading.

Load-bearing premise

Mapping images to lightweight textual UIDs and retrieving them on demand fully preserves the visual details needed for multi-hop cross-modal reasoning without retrieval errors or cumulative context loss over 100 turns.

What would settle it

A controlled test in which the agent is shown the same image early in a 60-turn trajectory, then later asked a question that requires correctly recalling and using details from that exact image, but retrieves or interprets it incorrectly due to UID or fetch failures.

Figures

Figures reproduced from arXiv: 2604.12890 by Jie Wu, Jinbiao Peng, Jinyang Li, Ji-Rong Wen, Junyi Li, Wayne Xin Zhao, Yifan Du, Zikang Liu.

Figure 1
Figure 1. Figure 1: An illustration of LMM-Searcher. For simplicity, we employ simple strings as uids in this [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The webpage returned to the agent. Its content is reorganized into a structured representation, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of automated Visual Question-Answer (VQA) data synthesis pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tool call distribution of the training data. Our synthesized data require diverse types of tool [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interactive scaling results. We evaluate different models with our context management [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detailed illustration of key steps within the model’s search trajectory. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LMM-Searcher, a framework for long-horizon agentic multimodal search that offloads visual assets to an external file system via lightweight textual UIDs and provides an on-demand fetch-image tool to mitigate context explosion while aiming to preserve multimodal information. It describes a data synthesis pipeline that generates 12K high-quality trajectories for fine-tuning Qwen3-VL-Thinking-30A3B into a specialized agent, and reports that the resulting system scales to 100-turn horizons while achieving SOTA performance among open-source models on long-horizon benchmarks including MM-BrowseComp and MMSearch-Plus, with claimed generalizability across base models.

Significance. If the empirical claims are substantiated, the work would offer a practical engineering solution to context management in extended multimodal agent trajectories, potentially enabling more scalable cross-modal multi-hop reasoning. The data synthesis approach for complex queries and the planned code release are constructive contributions that could support further research in agentic multimodal systems.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claims of scaling to 100-turn horizons and achieving SOTA results on MM-BrowseComp and MMSearch-Plus are stated without any reported baselines, ablation studies, error bars, or quantitative measurement of visual information loss or preservation, rendering it impossible to verify whether the performance claims hold.
  2. [§3.2] §3.2 (Method, file-based visual representation): The load-bearing assumption that UID mapping plus the fetch-image tool fully preserves fine-grained visual details for error-free 100-turn cross-modal multi-hop reasoning is asserted but unsupported by any metrics on fetch success rate, UID collision, retrieval errors, or an ablation that removes the fetch tool; the data-synthesis pipeline description also provides no evidence that it adequately samples cases where visual detail cannot be replaced by text.
  3. [§4] §4 (Experiments): The generalizability claim across different base models and the 100-turn scaling result rest on the UID/fetch mechanism without reported analysis of cumulative context loss, agent recall accuracy for UIDs over long horizons, or failure modes in multi-hop visual reasoning, which directly affects the soundness of the primary empirical contribution.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'strong generalizability across different base models' is used without specifying which models were tested or the magnitude of performance differences observed.
  2. [§3] §3: The description of the fetch-image tool would benefit from a concrete example of a multi-turn trajectory showing UID usage and image re-loading to clarify the on-demand strategy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. Where the comments identify opportunities to strengthen the empirical support for our claims, we agree to incorporate additional analyses, ablations, and quantitative metrics in the revised manuscript to improve verifiability and transparency.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claims of scaling to 100-turn horizons and achieving SOTA results on MM-BrowseComp and MMSearch-Plus are stated without any reported baselines, ablation studies, error bars, or quantitative measurement of visual information loss or preservation, rendering it impossible to verify whether the performance claims hold.

    Authors: We acknowledge the referee's point that the abstract provides a high-level summary. Section 4 of the manuscript does include direct comparisons to multiple open-source baseline agents on both MM-BrowseComp and MMSearch-Plus, establishing the SOTA results among open-source models. To fully address the concern and make all claims verifiable, we will revise §4 to include: (i) explicit baseline tables with all compared methods, (ii) ablation studies isolating the contribution of the UID/fetch mechanism, (iii) error bars computed over multiple random seeds, and (iv) quantitative metrics measuring visual information preservation (e.g., retrieval accuracy and information retention rates across trajectory lengths). These additions will be added to the revised manuscript. revision: yes

  2. Referee: [§3.2] §3.2 (Method, file-based visual representation): The load-bearing assumption that UID mapping plus the fetch-image tool fully preserves fine-grained visual details for error-free 100-turn cross-modal multi-hop reasoning is asserted but unsupported by any metrics on fetch success rate, UID collision, retrieval errors, or an ablation that removes the fetch tool; the data-synthesis pipeline description also provides no evidence that it adequately samples cases where visual detail cannot be replaced by text.

    Authors: The UID-based file system and fetch-image tool are central to avoiding context explosion while retaining visual access. We agree that supporting metrics and ablations would strengthen the presentation. In the revision, we will add: (i) an ablation study that disables the fetch tool and measures performance degradation, (ii) empirical statistics on fetch success rate, UID collision rate, and retrieval error rate observed during our 100-turn experiments, and (iii) concrete examples from the 12K-trajectory synthesis pipeline that highlight queries where textual descriptions alone are insufficient and visual details are required for correct multi-hop reasoning. revision: yes

  3. Referee: [§4] §4 (Experiments): The generalizability claim across different base models and the 100-turn scaling result rest on the UID/fetch mechanism without reported analysis of cumulative context loss, agent recall accuracy for UIDs over long horizons, or failure modes in multi-hop visual reasoning, which directly affects the soundness of the primary empirical contribution.

    Authors: Section 4 reports results demonstrating generalizability across multiple base models and successful scaling to 100-turn horizons. To provide deeper insight into the UID/fetch mechanism, we will expand the experimental analysis in the revision with: (i) measurements of cumulative context length growth with and without the UID approach, (ii) agent recall accuracy for previously seen UIDs across increasing horizon lengths, and (iii) a categorized breakdown of failure modes in multi-hop visual reasoning tasks. These additions will directly substantiate the scaling and generalizability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering framework evaluated on benchmarks

full rationale

The paper proposes LMM-Searcher as an engineering system using UID-based file offloading, a fetch tool, and a data synthesis pipeline to generate trajectories for fine-tuning Qwen3-VL. All performance claims (scaling to 100 turns, SOTA on MM-BrowseComp and MMSearch-Plus) rest on experimental results after fine-tuning, not on any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked; the central mechanism is a practical design choice whose validity is tested externally via benchmarks rather than reduced to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on standard assumptions in LLM fine-tuning and agent design plus two newly introduced mechanisms whose effectiveness is asserted empirically.

axioms (1)
  • domain assumption Fine-tuning a multimodal LLM on 12K synthesized long-horizon trajectories produces an agent that generalizes to real benchmarks.
    Invoked when the authors distill the model and claim strong generalizability.
invented entities (2)
  • file-based visual representation with UIDs no independent evidence
    purpose: Reduce token cost while allowing future access to images
    Core new mechanism introduced to solve context explosion.
  • fetch-image tool no independent evidence
    purpose: Enable progressive on-demand visual loading
    Tailored tool added to the agent action space.

pith-pipeline@v0.9.0 · 5576 in / 1429 out tokens · 65061 ms · 2026-05-10T15:37:28.372129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

Reference graph

Works this paper leans on

65 extracted references · 36 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  2. [2]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

  3. [3]

    Search-o1: Agentic search-enhanced large reasoning models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

  4. [4]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  5. [5]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  6. [6]

    Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  7. [7]

    Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

    Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

  8. [8]

    Redsearcher: A scalable and cost-efficient framework for long-horizon search agents.arXiv preprint arXiv:2602.14234, 2026

    Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang, Yue Yang, Guohai Xu, Chenxiao Zhao, Cheng Xiang, Shengchao Hu, et al. Redsearcher: A scalable and cost-efficient framework for long-horizon search agents.arXiv preprint arXiv:2602.14234, 2026

  9. [9]

    arXiv preprint arXiv:2504.08748 , year=

    Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. A survey of multimodal retrieval- augmented generation.arXiv preprint arXiv:2504.08748, 2025

  10. [10]

    arXiv preprint arXiv:2202.10936 (2022)

    Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. A survey of vision-language pre-trained models.arXiv preprint arXiv:2202.10936, 2022

  11. [11]

    A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

  12. [12]

    Towards efficient multimodal large language models: A survey on token compression

    Linli Yao, Long Xing, Yang Shi, Sida Li, Yuanxin Liu, Yuhao Dong, Yi-Fan Zhang, Lei Li, Qingxiu Dong, Xiaoyi Dong, et al. Towards efficient multimodal large language models: A survey on token compression. 2026

  13. [13]

    Token pruning in multimodal large language models: Are we solving the right problem? InFindings of the Association for Computational Linguistics: ACL 2025, pages 15537–15549, 2025

    Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. Token pruning in multimodal large language models: Are we solving the right problem? InFindings of the Association for Computational Linguistics: ACL 2025, pages 15537–15549, 2025

  14. [14]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. A survey of context engineering for large language models. arXiv preprint arXiv:2507.13334, 2025

  15. [15]

    Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

    Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

  16. [16]

    Iterresearch: Rethinking long-horizon agents via markovian state reconstruction.arXiv e-prints, pages arXiv–2511, 2025

    Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, et al. Iterresearch: Rethinking long-horizon agents via markovian state reconstruction.arXiv e-prints, pages arXiv–2511, 2025. 12

  17. [17]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022

  18. [18]

    Multimodal alignment and fusion: A survey.arXiv preprint arXiv:2411.17040, 2024

    Songtao Li and Hao Tang. Multimodal alignment and fusion: A survey.arXiv preprint arXiv:2411.17040, 2024

  19. [19]

    Exploring the design space of visual context representation in video mllms.arXiv preprint arXiv:2410.13694, 2024

    Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne Xin Zhao, Bingn- ing Wang, Weipeng Chen, and Ji-Rong Wen. Exploring the design space of visual context representation in video mllms.arXiv preprint arXiv:2410.13694, 2024

  20. [20]

    Large vision-language model alignment and misalignment: A survey through the lens of explainability

    Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, and Mengnan Du. Large vision-language model alignment and misalignment: A survey through the lens of explainability. arXiv preprint arXiv:2501.01346, 2025

  21. [21]

    Deepeyesv2: Toward agentic multimodal model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

  22. [22]

    planning-with-files

    Adi Othman. planning-with-files. https://github.com/othmanadi/planning-with-f iles, 2024. GitHub repository

  23. [23]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  24. [24]

    Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents, 2025b.URL https://arxiv

    Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, et al. Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents.arXiv preprint arXiv:2508.13186, 2025

  25. [25]

    Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

    Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong. Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

  26. [26]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  27. [27]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

  28. [28]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  29. [29]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  30. [30]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

  31. [31]

    A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges.arXiv preprint arXiv:2508.05668, 2025

    Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges.arXiv preprint arXiv:2508.05668, 2025

  32. [32]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020. 13

  33. [33]

    Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

  34. [34]

    Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 703...

  35. [35]

    A survey on rag meeting llms: Towards retrieval-augmented large language models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491–6501, 2024

  36. [36]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

  37. [37]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

  38. [38]

    Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

    Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

  39. [39]

    Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

    Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

  40. [40]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  41. [41]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  42. [42]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

  43. [43]

    Llava-plus: Learning to use tools for creating multimodal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024

  44. [44]

    Object detection in 20 years: A survey.Proceedings of the IEEE, 111(3):257–276, 2023

    Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceedings of the IEEE, 111(3):257–276, 2023

  45. [45]

    Fully convolutional networks for se- mantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for se- mantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

  46. [46]

    Scene text detection and recognition: The deep learning era.International Journal of Computer Vision, 129(1):161–184, 2021

    Shangbang Long, Xin He, and Cong Yao. Scene text detection and recognition: The deep learning era.International Journal of Computer Vision, 129(1):161–184, 2021

  47. [47]

    Virgo: A preliminary exploration on reproducing o1-like mllm

    Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025

  48. [48]

    Thinking with images.https://openai.com/index/thinking-with-images/, 2025

    OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025. 14

  49. [49]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  50. [50]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  51. [51]

    Fvqa: Fact-based visual question answering.IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427, 2017

    Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Fvqa: Fact-based visual question answering.IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427, 2017

  52. [52]

    Livevqa: Live visual knowledge seeking.arXiv e-prints, pages arXiv–2504, 2025

    Mingyang Fu, Yuyang Peng, Benlin Liu, Yao Wan, and Dongping Chen. Livevqa: Live visual knowledge seeking.arXiv e-prints, pages arXiv–2504, 2025

  53. [53]

    What makes for good visual instructions? synthesizing complex visual reasoning instructions for visual instruction tuning

    Yifan Du, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Jinpeng Wang, Chuyuan Wang, Mingchen Cai, Ruihua Song, and Ji-Rong Wen. What makes for good visual instructions? synthesizing complex visual reasoning instructions for visual instruction tuning. InProceedings of the 31st International Conference on Computational Linguistics, pages 8197–8214, 2025

  54. [54]

    Less is more: High-value data selection for visual instruction tuning

    Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, and Ji-Rong Wen. Less is more: High-value data selection for visual instruction tuning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3712–3721, 2025

  55. [55]

    Seed1.8 Model Card: Towards Generalized Real-World Agency

    Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

  56. [56]

    Bring reason to vision: Understanding perception and reasoning through model merging

    Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging.arXiv preprint arXiv:2505.05464, 2025

  57. [57]

    Vift: Towards vi- sual instruction-free fine-tuning for large vision-language models

    Zikang Liu, Kun Zhou, Xin Zhao, Dawei Gao, Yaliang Li, and Ji-Rong Wen. Vift: Towards vi- sual instruction-free fine-tuning for large vision-language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10341–10366, 2025

  58. [58]

    MiroThinker-1.7 & H1: Towards heavy-duty research agents via ver- ification

    MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

  59. [59]

    Visbrowse-bench: Benchmarking visual-native search for multimodal browsing agents.arXiv preprint arXiv:2603.16289, 2026

    Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, et al. Visbrowse-bench: Benchmarking visual-native search for multimodal browsing agents.arXiv preprint arXiv:2603.16289, 2026

  60. [60]

    Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

  61. [61]

    Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

  62. [62]

    Miroflow: Towards high-performance and robust open- source agent framework for general deep research tasks.arXiv preprint arXiv:2602.22808, 2026

    Shiqian Su, Sen Xing, Xuan Dong, Muyan Zhong, Bin Wang, Xizhou Zhu, Yuntao Chen, Wenhai Wang, Yue Deng, Pengxiang Zhu, et al. Miroflow: Towards high-performance and robust open- source agent framework for general deep research tasks.arXiv preprint arXiv:2602.22808, 2026

  63. [63]

    Llamafactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

  64. [64]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 15 A Case Study To concretely demonstrate the workflow of our framework, we select a case from VisBrowse- Bench and presen...

  65. [65]

    name": "tool-fetch- image-fetch_image

    | Title: 中国时… Thumbnails Agent <think> Wait, the first image is the China Times building with trees in front. Let's count the trees in that image. Let's fetch that image and analyze it. </think> <tool_call> { "name": "tool-fetch- image-fetch_image", "arguments": "{"url": " https://upload.wikimedia.org/... } </tool_call> Zoom-in Agent https://upload.wi kim...