arxiv: 2604.12890 · v2 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Recognition: unknown

Towards Long-horizon Agentic Multimodal Search

Yifan Du , Zikang Liu , Jinbiao Peng , Jie Wu , Junyi Li , Jinyang Li , Wayne Xin Zhao , Ji-Rong Wen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords long-horizon searchmultimodal agentsvisual representationagentic searchmulti-hop reasoningfile-based storagevision-language modelscontext management

0 comments

The pith

Mapping images to textual UIDs lets multimodal agents sustain 100-turn searches without visual loss or token overload.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LMM-Searcher, a framework that stores visual assets in an external file system and replaces them in the agent's context with short textual identifiers. Agents then retrieve specific images only when needed through a dedicated fetch tool, which keeps context length manageable across many steps. A separate pipeline generates 12,000 synthetic trajectories that emphasize cross-modal multi-hop reasoning, and these are used to fine-tune a base vision-language model. Experiments on four benchmarks show the resulting agent reaches state-of-the-art open-source performance on long-horizon tasks such as MM-BrowseComp and MMSearch-Plus while working across different underlying models.

Core claim

The central claim is that offloading images to a file system, mapping each to a lightweight textual UID, and providing an on-demand fetch-image tool together enable multimodal search agents to operate over 100 iterative turns without context explosion or loss of visual signals. This representation is paired with a data synthesis method that creates 12K trajectories of complex cross-modal reasoning, which are then used to specialize a vision-language model for the search task.

What carries the argument

The file-based visual representation that replaces images with textual UIDs and supplies a fetch-image tool for progressive, on-demand loading.

Load-bearing premise

Mapping images to lightweight textual UIDs and retrieving them on demand fully preserves the visual details needed for multi-hop cross-modal reasoning without retrieval errors or cumulative context loss over 100 turns.

What would settle it

A controlled test in which the agent is shown the same image early in a 60-turn trajectory, then later asked a question that requires correctly recalling and using details from that exact image, but retrieves or interprets it incorrectly due to UID or fetch failures.

Figures

Figures reproduced from arXiv: 2604.12890 by Jie Wu, Jinbiao Peng, Jinyang Li, Ji-Rong Wen, Junyi Li, Wayne Xin Zhao, Yifan Du, Zikang Liu.

**Figure 2.** Figure 2: The webpage returned to the agent. Its content is reorganized into a structured representation, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of automated Visual Question-Answer (VQA) data synthesis pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Tool call distribution of the training data. Our synthesized data require diverse types of tool [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Interactive scaling results. We evaluate different models with our context management [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Detailed illustration of key steps within the model’s search trajectory. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMM-Searcher gives a workable file-based UID plus fetch mechanism to push multimodal agents to 100 turns without immediate context blowup, but the preservation of visual detail over long horizons rests on untested assumptions.

read the letter

LMM-Searcher tackles the context explosion problem in multimodal agents by offloading visual data to an external file system and using lightweight UIDs for reference, with a fetch tool to load images when needed. The new piece is the combination of that file-based representation, the on-demand fetch, and the data synthesis pipeline that produced 12K trajectories for fine-tuning Qwen3-VL. They report scaling to 100-turn horizons and SOTA results on MM-BrowseComp and MMSearch-Plus among open-source models, plus good transfer to other base models. The synthesis method for generating complex cross-modal multi-hop queries is a useful engineering step that prior work on agents hasn't detailed in the same way. The soft spot is the experimental support for the claim that visual information is preserved without loss over long sequences. The abstract mentions mitigating context overhead while preserving multimodal info, but gives no numbers on fetch success, UID accuracy, or ablations that remove the fetch tool to show its contribution. Without those, it's unclear if the gains come from the mechanism or from the fine-tuning data itself. The stress-test concern about fine-grained details and compounding errors over 100 turns holds up based on what's shown. This paper is for researchers focused on building reliable long-horizon agents that mix text and vision, especially those dealing with web or document search tasks. A reader interested in concrete implementations for reducing token costs in extended interactions would get value from the framework description and the trajectory generation approach. It deserves a serious referee because the problem is real and the proposed fix is specific enough to evaluate, even if the current results section needs strengthening with more controls. I would recommend sending it to peer review after asking for the missing ablation details and error analysis.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LMM-Searcher, a framework for long-horizon agentic multimodal search that offloads visual assets to an external file system via lightweight textual UIDs and provides an on-demand fetch-image tool to mitigate context explosion while aiming to preserve multimodal information. It describes a data synthesis pipeline that generates 12K high-quality trajectories for fine-tuning Qwen3-VL-Thinking-30A3B into a specialized agent, and reports that the resulting system scales to 100-turn horizons while achieving SOTA performance among open-source models on long-horizon benchmarks including MM-BrowseComp and MMSearch-Plus, with claimed generalizability across base models.

Significance. If the empirical claims are substantiated, the work would offer a practical engineering solution to context management in extended multimodal agent trajectories, potentially enabling more scalable cross-modal multi-hop reasoning. The data synthesis approach for complex queries and the planned code release are constructive contributions that could support further research in agentic multimodal systems.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central claims of scaling to 100-turn horizons and achieving SOTA results on MM-BrowseComp and MMSearch-Plus are stated without any reported baselines, ablation studies, error bars, or quantitative measurement of visual information loss or preservation, rendering it impossible to verify whether the performance claims hold.
[§3.2] §3.2 (Method, file-based visual representation): The load-bearing assumption that UID mapping plus the fetch-image tool fully preserves fine-grained visual details for error-free 100-turn cross-modal multi-hop reasoning is asserted but unsupported by any metrics on fetch success rate, UID collision, retrieval errors, or an ablation that removes the fetch tool; the data-synthesis pipeline description also provides no evidence that it adequately samples cases where visual detail cannot be replaced by text.
[§4] §4 (Experiments): The generalizability claim across different base models and the 100-turn scaling result rest on the UID/fetch mechanism without reported analysis of cumulative context loss, agent recall accuracy for UIDs over long horizons, or failure modes in multi-hop visual reasoning, which directly affects the soundness of the primary empirical contribution.

minor comments (2)

[Abstract] Abstract: The phrase 'strong generalizability across different base models' is used without specifying which models were tested or the magnitude of performance differences observed.
[§3] §3: The description of the fetch-image tool would benefit from a concrete example of a multi-turn trajectory showing UID usage and image re-loading to clarify the on-demand strategy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. Where the comments identify opportunities to strengthen the empirical support for our claims, we agree to incorporate additional analyses, ablations, and quantitative metrics in the revised manuscript to improve verifiability and transparency.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claims of scaling to 100-turn horizons and achieving SOTA results on MM-BrowseComp and MMSearch-Plus are stated without any reported baselines, ablation studies, error bars, or quantitative measurement of visual information loss or preservation, rendering it impossible to verify whether the performance claims hold.

Authors: We acknowledge the referee's point that the abstract provides a high-level summary. Section 4 of the manuscript does include direct comparisons to multiple open-source baseline agents on both MM-BrowseComp and MMSearch-Plus, establishing the SOTA results among open-source models. To fully address the concern and make all claims verifiable, we will revise §4 to include: (i) explicit baseline tables with all compared methods, (ii) ablation studies isolating the contribution of the UID/fetch mechanism, (iii) error bars computed over multiple random seeds, and (iv) quantitative metrics measuring visual information preservation (e.g., retrieval accuracy and information retention rates across trajectory lengths). These additions will be added to the revised manuscript. revision: yes
Referee: [§3.2] §3.2 (Method, file-based visual representation): The load-bearing assumption that UID mapping plus the fetch-image tool fully preserves fine-grained visual details for error-free 100-turn cross-modal multi-hop reasoning is asserted but unsupported by any metrics on fetch success rate, UID collision, retrieval errors, or an ablation that removes the fetch tool; the data-synthesis pipeline description also provides no evidence that it adequately samples cases where visual detail cannot be replaced by text.

Authors: The UID-based file system and fetch-image tool are central to avoiding context explosion while retaining visual access. We agree that supporting metrics and ablations would strengthen the presentation. In the revision, we will add: (i) an ablation study that disables the fetch tool and measures performance degradation, (ii) empirical statistics on fetch success rate, UID collision rate, and retrieval error rate observed during our 100-turn experiments, and (iii) concrete examples from the 12K-trajectory synthesis pipeline that highlight queries where textual descriptions alone are insufficient and visual details are required for correct multi-hop reasoning. revision: yes
Referee: [§4] §4 (Experiments): The generalizability claim across different base models and the 100-turn scaling result rest on the UID/fetch mechanism without reported analysis of cumulative context loss, agent recall accuracy for UIDs over long horizons, or failure modes in multi-hop visual reasoning, which directly affects the soundness of the primary empirical contribution.

Authors: Section 4 reports results demonstrating generalizability across multiple base models and successful scaling to 100-turn horizons. To provide deeper insight into the UID/fetch mechanism, we will expand the experimental analysis in the revision with: (i) measurements of cumulative context length growth with and without the UID approach, (ii) agent recall accuracy for previously seen UIDs across increasing horizon lengths, and (iii) a categorized breakdown of failure modes in multi-hop visual reasoning tasks. These additions will directly substantiate the scaling and generalizability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering framework evaluated on benchmarks

full rationale

The paper proposes LMM-Searcher as an engineering system using UID-based file offloading, a fetch tool, and a data synthesis pipeline to generate trajectories for fine-tuning Qwen3-VL. All performance claims (scaling to 100 turns, SOTA on MM-BrowseComp and MMSearch-Plus) rest on experimental results after fine-tuning, not on any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked; the central mechanism is a practical design choice whose validity is tested externally via benchmarks rather than reduced to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on standard assumptions in LLM fine-tuning and agent design plus two newly introduced mechanisms whose effectiveness is asserted empirically.

axioms (1)

domain assumption Fine-tuning a multimodal LLM on 12K synthesized long-horizon trajectories produces an agent that generalizes to real benchmarks.
Invoked when the authors distill the model and claim strong generalizability.

invented entities (2)

file-based visual representation with UIDs no independent evidence
purpose: Reduce token cost while allowing future access to images
Core new mechanism introduced to solve context explosion.
fetch-image tool no independent evidence
purpose: Enable progressive on-demand visual loading
Tailored tool added to the agent action space.

pith-pipeline@v0.9.0 · 5576 in / 1429 out tokens · 65061 ms · 2026-05-10T15:37:28.372129+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
cs.CL 2026-05 unverdicted novelty 7.0

TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

Reference graph

Works this paper leans on

65 extracted references · 36 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review arXiv 2025
[2]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

work page internal anchor Pith review arXiv 2025
[3]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

2025
[4]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review arXiv 2025
[5]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023
[6]

Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page arXiv 2025
[7]

Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

work page arXiv 2026
[8]

Redsearcher: A scalable and cost-efficient framework for long-horizon search agents.arXiv preprint arXiv:2602.14234, 2026

Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang, Yue Yang, Guohai Xu, Chenxiao Zhao, Cheng Xiang, Shengchao Hu, et al. Redsearcher: A scalable and cost-efficient framework for long-horizon search agents.arXiv preprint arXiv:2602.14234, 2026

work page arXiv 2026
[9]

arXiv preprint arXiv:2504.08748 , year=

Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. A survey of multimodal retrieval- augmented generation.arXiv preprint arXiv:2504.08748, 2025

work page arXiv 2025
[10]

arXiv preprint arXiv:2202.10936 (2022)

Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. A survey of vision-language pre-trained models.arXiv preprint arXiv:2202.10936, 2022

work page arXiv 2022
[11]

A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

2024
[12]

Towards efficient multimodal large language models: A survey on token compression

Linli Yao, Long Xing, Yang Shi, Sida Li, Yuanxin Liu, Yuhao Dong, Yi-Fan Zhang, Lei Li, Qingxiu Dong, Xiaoyi Dong, et al. Towards efficient multimodal large language models: A survey on token compression. 2026

2026
[13]

Token pruning in multimodal large language models: Are we solving the right problem? InFindings of the Association for Computational Linguistics: ACL 2025, pages 15537–15549, 2025

Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. Token pruning in multimodal large language models: Are we solving the right problem? InFindings of the Association for Computational Linguistics: ACL 2025, pages 15537–15549, 2025

2025
[14]

A Survey of Context Engineering for Large Language Models

Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. A survey of context engineering for large language models. arXiv preprint arXiv:2507.13334, 2025

work page internal anchor Pith review arXiv 2025
[15]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025
[16]

Iterresearch: Rethinking long-horizon agents via markovian state reconstruction.arXiv e-prints, pages arXiv–2511, 2025

Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, et al. Iterresearch: Rethinking long-horizon agents via markovian state reconstruction.arXiv e-prints, pages arXiv–2511, 2025. 12

2025
[17]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022

2022
[18]

Multimodal alignment and fusion: A survey.arXiv preprint arXiv:2411.17040, 2024

Songtao Li and Hao Tang. Multimodal alignment and fusion: A survey.arXiv preprint arXiv:2411.17040, 2024

work page arXiv 2024
[19]

Exploring the design space of visual context representation in video mllms.arXiv preprint arXiv:2410.13694, 2024

Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne Xin Zhao, Bingn- ing Wang, Weipeng Chen, and Ji-Rong Wen. Exploring the design space of visual context representation in video mllms.arXiv preprint arXiv:2410.13694, 2024

work page arXiv 2024
[20]

Large vision-language model alignment and misalignment: A survey through the lens of explainability

Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, and Mengnan Du. Large vision-language model alignment and misalignment: A survey through the lens of explainability. arXiv preprint arXiv:2501.01346, 2025

work page arXiv 2025
[21]

Deepeyesv2: Toward agentic multimodal model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page arXiv 2025
[22]

planning-with-files

Adi Othman. planning-with-files. https://github.com/othmanadi/planning-with-f iles, 2024. GitHub repository

2024
[23]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review arXiv 2026
[24]

Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents, 2025b.URL https://arxiv

Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, et al. Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents.arXiv preprint arXiv:2508.13186, 2025

work page arXiv 2025
[25]

Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong. Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

work page arXiv 2025
[26]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[27]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

work page internal anchor Pith review arXiv 2023
[28]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[30]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges.arXiv preprint arXiv:2508.05668, 2025

Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges.arXiv preprint arXiv:2508.05668, 2025

work page arXiv 2025
[32]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020. 13

2020
[33]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

2023
[34]

Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 703...

2024
[35]

A survey on rag meeting llms: Towards retrieval-augmented large language models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491–6501, 2024

2024
[36]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

2020
[37]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

2019
[38]

Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

work page arXiv 2025
[39]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

work page arXiv 2025
[40]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[41]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

work page internal anchor Pith review arXiv 2023
[43]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024

2024
[44]

Object detection in 20 years: A survey.Proceedings of the IEEE, 111(3):257–276, 2023

Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceedings of the IEEE, 111(3):257–276, 2023

2023
[45]

Fully convolutional networks for se- mantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for se- mantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

2015
[46]

Scene text detection and recognition: The deep learning era.International Journal of Computer Vision, 129(1):161–184, 2021

Shangbang Long, Xin He, and Cong Yao. Scene text detection and recognition: The deep learning era.International Journal of Computer Vision, 129(1):161–184, 2021

2021
[47]

Virgo: A preliminary exploration on reproducing o1-like mllm

Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025

work page arXiv 2025
[48]

Thinking with images.https://openai.com/index/thinking-with-images/, 2025

OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025. 14

2025
[49]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review arXiv 2025
[50]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review arXiv 2025
[51]

Fvqa: Fact-based visual question answering.IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427, 2017

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Fvqa: Fact-based visual question answering.IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427, 2017

2017
[52]

Livevqa: Live visual knowledge seeking.arXiv e-prints, pages arXiv–2504, 2025

Mingyang Fu, Yuyang Peng, Benlin Liu, Yao Wan, and Dongping Chen. Livevqa: Live visual knowledge seeking.arXiv e-prints, pages arXiv–2504, 2025

2025
[53]

What makes for good visual instructions? synthesizing complex visual reasoning instructions for visual instruction tuning

Yifan Du, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Jinpeng Wang, Chuyuan Wang, Mingchen Cai, Ruihua Song, and Ji-Rong Wen. What makes for good visual instructions? synthesizing complex visual reasoning instructions for visual instruction tuning. InProceedings of the 31st International Conference on Computational Linguistics, pages 8197–8214, 2025

2025
[54]

Less is more: High-value data selection for visual instruction tuning

Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, and Ji-Rong Wen. Less is more: High-value data selection for visual instruction tuning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3712–3721, 2025

2025
[55]

Seed1.8 Model Card: Towards Generalized Real-World Agency

Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Bring reason to vision: Understanding perception and reasoning through model merging

Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging.arXiv preprint arXiv:2505.05464, 2025

work page arXiv 2025
[57]

Vift: Towards vi- sual instruction-free fine-tuning for large vision-language models

Zikang Liu, Kun Zhou, Xin Zhao, Dawei Gao, Yaliang Li, and Ji-Rong Wen. Vift: Towards vi- sual instruction-free fine-tuning for large vision-language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10341–10366, 2025

2025
[58]

MiroThinker-1.7 & H1: Towards heavy-duty research agents via ver- ification

MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

work page arXiv 2026
[59]

Visbrowse-bench: Benchmarking visual-native search for multimodal browsing agents.arXiv preprint arXiv:2603.16289, 2026

Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, et al. Visbrowse-bench: Benchmarking visual-native search for multimodal browsing agents.arXiv preprint arXiv:2603.16289, 2026

work page arXiv 2026
[60]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

work page arXiv 2024
[61]

Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

work page arXiv 2025
[62]

Miroflow: Towards high-performance and robust open- source agent framework for general deep research tasks.arXiv preprint arXiv:2602.22808, 2026

Shiqian Su, Sen Xing, Xuan Dong, Muyan Zhong, Bin Wang, Xizhou Zhu, Yuntao Chen, Wenhai Wang, Yue Deng, Pengxiang Zhu, et al. Miroflow: Towards high-performance and robust open- source agent framework for general deep research tasks.arXiv preprint arXiv:2602.22808, 2026

work page arXiv 2026
[63]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

2024
[64]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 15 A Case Study To concretely demonstrate the workflow of our framework, we select a case from VisBrowse- Bench and presen...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

name": "tool-fetch- image-fetch_image

| Title: 中国时… Thumbnails Agent <think> Wait, the first image is the China Times building with trees in front. Let's count the trees in that image. Let's fetch that image and analyze it. </think> <tool_call> { "name": "tool-fetch- image-fetch_image", "arguments": "{"url": " https://upload.wikimedia.org/... } </tool_call> Zoom-in Agent https://upload.wi kim...

2009