arxiv: 2604.20486 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

Wentao Yan , Shengqin Wang , Huichi Zhou , Yihang Chen , Kun Shao , Yuan Xie , Zhizhong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal search agentprocess-oriented rewardssim-to-real transfervisual question answeringzero-shot generalizationreinforcement learning

0 comments

The pith

A multimodal search agent trained entirely in a static local sandbox with introspective rewards transfers zero-shot to live Google Search and sets new records on visual reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the problems of sparse outcome rewards and chaotic live web environments when training agents for knowledge-heavy visual questions. It trains the entire policy inside a deterministic local sandbox instead of the real web. An introspective process-oriented reward probes the agent's own knowledge limits to give dense feedback on when to start a search. The resulting policy moves directly to the live Google Search API and beats prior methods by clear margins on standard test sets.

Core claim

By decoupling policy learning into a deterministic local static sandbox and using an introspective process-oriented reward that generates dense behavioral metadata, the agent learns to initiate multimodal or text searches only when it detects visual or factual uncertainty; the locally trained policy then transfers zero-shot to the live Google Search API and reaches new state-of-the-art accuracy.

What carries the argument

The introspective process-oriented reward, which probes the agent's parametric knowledge boundaries inside the sandbox to supply dense signals that reward correct decisions about when to search.

If this is right

The agent outperforms MMSearch-R1 by 5.1 percent on FVQA-test, 6.3 percent on InfoSeek, and 11.3 percent on MMSearch.
Policy learning can be completed entirely offline in a controlled sandbox before any live API calls are made.
Search initiation decisions become more precise because rewards target cognitive uncertainty rather than final answer correctness alone.
Zero-shot transfer removes the need to expose training to unpredictable or costly real-web interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sandbox-plus-introspective-reward pattern may reduce training costs for other web-facing agents that currently require live interaction during learning.
Process-level signals about knowledge gaps could be combined with other modalities or tasks where outcome rewards are equally sparse.
Extending the sandbox with more varied simulated web responses might further close the remaining gap to fully online training.

Load-bearing premise

The static sandbox and the agent's self-probed knowledge boundaries are similar enough to live web dynamics that the learned search decisions will work without further training.

What would settle it

Running the trained agent on the live Google Search API and observing no performance gain or outright failure compared with training directly in the real environment.

Figures

Figures reproduced from arXiv: 2604.20486 by Huichi Zhou, Kun Shao, Shengqin Wang, Wentao Yan, Yihang Chen, Yuan Xie, Zhizhong Zhang.

**Figure 1.** Figure 1: Training Framework Comparison. (Top) The previous coupled framework relies on live APIs, resulting in High Cost, High Latency, Instability, and Irreproducibility. (Bottom) Our decoupled framework uses a Tool Sandbox with a local cache and Wiki dump, enabling Low Cost, Low Latency, Stable, and Reproducible training. formulate queries, often utilizing multimodal tools like image-to-image search to ground spe… view at source ↗

**Figure 2.** Figure 2: Illustration of data construction process of our datasets: (a). Website information and image collection; (b). VQA generation module; (c). Data filtering and search partitioning; (d). Visualization of problem type classification; (e). Visualization of search types. isolates the final outcome from the process, allowing for clean, consistent credit assignment. In detail, reward–action mismatch is exemplifie… view at source ↗

**Figure 3.** Figure 3: The ProMMSearchAgent framework. The agent generates rollouts by deciding between direct answers or multi-step tool calls. The MLLM is optimized using GRPO. The reward signal is a composite of three components: outcome (Routcome), behavior (Rbehavior), and format (Rformat) rewards. is similar in function to format scoring in prior works. Rbehavior provides a dense reward based only on the correctness of the… view at source ↗

**Figure 4.** Figure 4: Case Comparison of Complex Multi-Step Reasoning. The baseline model (left) employs a deviant search-reasoning strategy, leading to an incorrect answer. Our model (right) first locates the key entity via image search, then uses this cue to execute a precise text search, reasoning successfully to the correct answer. Under the ReAct Agent setting, enabling Qwen2.5-VL models to use tools leads to clear gains o… view at source ↗

read the original abstract

Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim is zero-shot transfer of a multimodal search policy from a local deterministic sandbox to live Google API via an introspective process reward, but the abstract supplies almost no evidence that the sandbox or reward actually bridge that gap.

read the letter

The main thing to know is that they train the agent entirely inside a static local sandbox with a reward that probes the model's own parametric knowledge to decide when to search, then report that the resulting policy works zero-shot on the real Google Search API and beats MMSearch-R1 by 5-11 points on three benchmarks. That is the punchline they want you to take away. What is actually new is the specific form of the process-oriented reward: instead of waiting for final-answer correctness, it generates dense feedback by checking whether the agent should have known the answer from its parameters and only rewards the decision to search when uncertainty is detected. This is a direct attempt to fix the sparsity problem in RL for knowledge-intensive visual reasoning without needing live web calls during training. The paper does a reasonable job stating the two bottlenecks it targets—sparse outcome rewards and unpredictable live environments—and framing the sandbox as a way around the second one. The concrete percentage gains are at least something to look at. The soft spots are right where the stress-test note flags them. The abstract gives no account of how the sandbox is built, whether it injects any variability to stand in for real API noise, ranking changes, or latency, and no ablations that separate the contribution of the introspective reward from simply training in the sandbox. Without those pieces it is impossible to tell whether the reported transfer is robust or whether the gains are mostly benchmark-specific. The assumption that a deterministic static environment plus knowledge-probing rewards will capture the cognitive and environmental demands of live search is doing a lot of work here, and nothing in the provided text shows it holds. This is for people working on tool-augmented multimodal agents and on reward design for RL in sparse settings. A reader who needs new ideas for process supervision might extract the reward sketch, but anyone who wants to replicate or extend the zero-shot result will have to wait for the full methods and controls. It deserves a serious referee. The problem is real and the direction is worth checking, even if the current write-up leaves the central transfer claim under-supported.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces ProMMSearchAgent, a multimodal search agent for knowledge-intensive visual reasoning. It proposes a Sim-to-Real paradigm that decouples policy learning into a deterministic local static sandbox and uses an introspective process-oriented reward. This reward probes the agent's parametric knowledge boundaries to generate dense supervision, rewarding correct cognitive decisions such as initiating a multimodal or text search only when uncertain. The central claim is that the locally trained policy transfers zero-shot to the live Google Search API, achieving new SOTA results with gains of +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch over MMSearch-R1.

Significance. If the zero-shot transfer and reported gains hold under detailed scrutiny, the work would be significant for multimodal agent training. The introspective process-oriented reward offers a promising mechanism to address extreme outcome sparsity by leveraging the agent's own knowledge boundaries for dense behavioral feedback, and the Sim-to-Real setup could enable more efficient training without constant live API access. This could influence future designs for generalizable agents in unpredictable environments, provided the sim-to-real gap is rigorously characterized.

major comments (3)

[Abstract] Abstract: The specific quantitative gains (+5.1% on FVQA-test, +6.3% on InfoSeek, +11.3% on MMSearch) and zero-shot transfer claim are presented without any reference to experimental setup, baselines, number of runs, variance, or statistical tests. This absence is load-bearing for the central SOTA and generalization claims, as the data cannot be assessed for robustness.
[Method] Method section: The deterministic local static sandbox is introduced without any description of its construction, such as how search results are mocked, whether result variability or latency is simulated, or how uncertainty is injected. This detail is essential to evaluate whether the introspective reward truly enables transfer to live Google Search API stochasticity.
[Experiments] Experiments section: No ablations are mentioned that isolate the introspective process-oriented reward's contribution to zero-shot transfer versus standard outcome rewards or other factors. Without these, it remains possible that the reported improvements stem from benchmark alignment rather than the proposed paradigm.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the clarity and rigor of our claims regarding the Sim-to-Real paradigm and the introspective reward. We address each major comment point-by-point below, with proposed revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The specific quantitative gains (+5.1% on FVQA-test, +6.3% on InfoSeek, +11.3% on MMSearch) and zero-shot transfer claim are presented without any reference to experimental setup, baselines, number of runs, variance, or statistical tests. This absence is load-bearing for the central SOTA and generalization claims, as the data cannot be assessed for robustness.

Authors: We agree that the abstract should better contextualize the quantitative claims to support immediate assessment of robustness. While the full experimental protocol (including baselines such as MMSearch-R1, evaluation on FVQA-test/InfoSeek/MMSearch, averaging over 3 random seeds with reported standard deviations, and statistical comparisons) is detailed in Section 4, we will revise the abstract to add a brief reference to the evaluation setup and note that variance and significance details are provided in the main text. revision: yes
Referee: [Method] Method section: The deterministic local static sandbox is introduced without any description of its construction, such as how search results are mocked, whether result variability or latency is simulated, or how uncertainty is injected. This detail is essential to evaluate whether the introspective reward truly enables transfer to live Google Search API stochasticity.

Authors: The sandbox is introduced in Section 3 as a deterministic local static environment using cached real-world search results to enable efficient policy training without live API calls. However, we acknowledge the current description is high-level and lacks explicit details on result mocking mechanics, the deliberate choice not to simulate variability or latency (to isolate policy learning), and uncertainty injection via the agent's internal parametric knowledge probes. We will expand Section 3.1 with these specifics to better characterize the Sim-to-Real gap and support the zero-shot transfer claim. revision: yes
Referee: [Experiments] Experiments section: No ablations are mentioned that isolate the introspective process-oriented reward's contribution to zero-shot transfer versus standard outcome rewards or other factors. Without these, it remains possible that the reported improvements stem from benchmark alignment rather than the proposed paradigm.

Authors: We recognize the value of isolating the process-oriented reward's role in enabling zero-shot transfer. The main experiments include comparisons to outcome-reward baselines (Table 2), but we did not provide a dedicated ablation focused on the transfer setting. We will add a new ablation subsection (and associated table) in the revised Experiments section that directly compares the full model against an outcome-reward-only variant under zero-shot live API evaluation, to demonstrate the contribution of the introspective component. revision: yes

Circularity Check

0 steps flagged

No circularity detected; no equations, derivations, or self-referential reductions in the described training paradigm

full rationale

The abstract and provided text describe a Sim-to-Real paradigm decoupling policy learning into a deterministic local sandbox with an introspective process-oriented reward that probes parametric knowledge boundaries to generate dense behavioral metadata. No mathematical equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or uniqueness theorems appear. The zero-shot transfer to live Google Search API is presented as an empirical outcome from extensive experiments rather than a constructed prediction equivalent to inputs. The central claim rests on the assumption that the sandbox and reward capture necessary dynamics, but this is not reduced by definition or self-reference to the reported SOTA gains. No load-bearing steps reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects high-level claims rather than detailed derivations. The central claim rests on the transferability of a sandbox-trained policy and the effectiveness of the introspective reward.

axioms (2)

domain assumption The deterministic local static sandbox sufficiently approximates live web search dynamics for zero-shot policy transfer.
Invoked to justify training in simulation and testing on the live Google Search API.
domain assumption Probing the agent's own parametric knowledge boundaries produces dense, correct behavioral metadata that guides search decisions.
Central premise of the process-oriented reward mechanism.

invented entities (1)

introspective process-oriented reward no independent evidence
purpose: Generate dense rewards by rewarding correct decisions on whether to search based on internal uncertainty.
New reward formulation introduced to address sparsity of outcome-based supervision.

pith-pipeline@v0.9.0 · 5489 in / 1455 out tokens · 63786 ms · 2026-05-10T01:07:57.853863+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 24 canonical work pages · 10 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Learning to reason with search for llms via reinforcement learning,

Chen, M., Sun, L., Li, T., Sun, H., Zhou, Y., Zhu, C., Wang, H., Pan, J.Z., Zhang, W., Chen, H., et al.: Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470 (2025)

work page arXiv 2025
[3]

Chen, Y., Hu, H., Luan, Y., Sun, H., Changpinyo, S., Ritter, A., Chang, M.W.: Can pre-trained vision and language models answer visual information-seeking ques- tions? arXiv preprint arXiv:2302.11713 (2023)

work page arXiv 2023
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

2024
[5]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cheng, X., Zhang, W., Zhang, S., Yang, J., Guan, X., Wu, X., Li, X., Zhang, G., Liu, J., Mai, Y., et al.: Simplevqa: Multimodal factuality evaluation for multimodal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4637–4646 (2025)

2025
[6]

Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles.arXiv preprint arXiv:2503.17352, 2025

Deng, Y., Bansal, H., Yin, F., Peng, N., Wang, W., Chang, K.W.: Openvl- thinker: An early exploration to complex vision-language reasoning via iterative self-improvement. arXiv preprint arXiv:2503.17352 (2025)

work page arXiv 2025
[7]

S., and Krishna, R

Fu, M., Peng, Y., Liu, B., Wan, Y., Chen, D.: Livevqa: Live visual knowledge seeking. arXiv preprint arXiv:2504.05288 (2025)

work page arXiv 2025
[8]

Geng, X., Xia, P., Zhang, Z., Wang, X., Wang, Q., Ding, R., Wang, C., Wu, J., Zhao, Y., Li, K., Jiang, Y., Xie, P., Huang, F., Zhou, J.: Webwatcher: Breaking new frontier of vision-language deep research agent (2025),https://arxiv.org/ abs/2508.05748

work page arXiv 2025
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: Proceedings of the 37th International Conference on Machine Learning

Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: Realm: retrieval-augmented language model pre-training. In: Proceedings of the 37th International Conference on Machine Learning. ICML’20, JMLR.org (2020)

2020
[11]

Advances in Neural Information Processing Systems36, 867–878 (2023)

Hu, Z., Iscen, A., Sun, C., Chang, K.W., Sun, Y., Ross, D., Schmid, C., Fathi, A.: Avis: Autonomous visual information seeking with large language model agent. Advances in Neural Information Processing Systems36, 867–878 (2023)

2023
[12]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Yan et al

Jiang, B., Xie, Y., Wang, X., Yuan, Y., Hao, Z., Bai, X., Su, W.J., Taylor, C.J., Mallick, T.: Towards rationality in language and multimodal agents: A survey (2025),https://arxiv.org/abs/2406.00252 16 W. Yan et al

work page arXiv 2025
[14]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

Jiang, D., Zhang, R., Guo, Z., Wu, Y., Lei, J., Qiu, P., Lu, P., Chen, Z., Fu, C., Song, G., et al.: Mmsearch: Benchmarking the potential of large models as multi- modal search engines. arXiv preprint arXiv:2409.12959 (2024)

work page arXiv 2024
[15]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., Han, J.: Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516 (2025)

work page internal anchor Pith review arXiv 2025
[16]

In: Webber, B., Cohn, T., He, Y., Liu, Y

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 6769–6781. Association for Computational Linguistics, Online (...

2020
[17]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review arXiv 2024
[18]

Aria: An open multimodal native mixture-of-experts model

Li, D., Liu, Y., Wu, H., Wang, Y., Shen, Z., Qu, B., Niu, X., Zhou, F., Huang, C., Li, Y., et al.: Aria: An open multimodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993 (2024)

work page arXiv 2024
[19]

Li,X.,Jin,J.,Dong,G.,Qian,H.,Wu,Y.,Wen,J.R.,Zhu,Y.,Dou,Z.:Webthinker: Empowering large reasoning models with deep research capability (2025),https: //arxiv.org/abs/2504.21776

work page arXiv 2025
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024
[21]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[22]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection (2024),https://arxiv.org/abs/2303.05499

work page internal anchor Pith review arXiv 2024
[23]

M., Shiee, N., Grasch, P., Jia, C., Yang, Y ., and Gan, Z

Narayan, K., Xu, Y., Cao, T., Nerella, K., Patel, V.M., Shiee, N., Grasch, P., Jia, C., Yang, Y., Gan, Z.: Deepmmsearch-r1: Empowering multimodal llms in multimodal web search. arXiv preprint arXiv:2510.12801 (2025)

work page arXiv 2025
[24]

GPT-4o System Card

OpenAI, :, Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Akila Welihinda, e.a.: Gpt-4o system card (2024),https://arxiv. org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

In: The Twelfth International Conference on Learning Representations (2024)

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Ye, Q., Wei, F.: Ground- ing multimodal large language models to the world. In: The Twelfth International Conference on Learning Representations (2024)

2024
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv. org/abs/2402.033002(3), 5 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

In: Proceedings of the Twentieth European Conference on Computer Systems

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. In: Proceedings of the Twentieth European Conference on Computer Systems. pp. 1279–1297 (2025)

2025
[28]

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., Wei, F.: Text embeddings by weakly-supervised contrastive pre-training (2024), https://arxiv.org/abs/2212.03533

work page internal anchor Pith review arXiv 2024
[29]

Wu, J., Yin, W., Jiang, Y., Wang, Z., Xi, Z., Fang, R., Zhang, L., He, Y., Zhou, D., Xie, P., Huang, F.: Webwalker: Benchmarking llms in web traversal (2025), https://arxiv.org/abs/2501.07572 ProMMSearchAgent 17

work page arXiv 2025
[30]

Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

Wu, J., Deng, Z., Li, W., Liu, Y., You, B., Li, B., Ma, Z., Liu, Z.: Mmsearch-r1: Incentivizing lmms to search. arXiv preprint arXiv:2506.20670 (2025)

work page arXiv 2025
[31]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: React: Synergizing reasoning and acting in language models (2023),https://arxiv.org/ abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Zhang, Z., Zhang, Y., Ding, X., Yue, X.: Vision search assistant: Empower vision- language models as multimodal search engines (2024),https://arxiv.org/abs/ 2410.21220

work page arXiv 2024
[33]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Zheng, Y., Fu, D., Hu, X., Cai, X., Ye, L., Lu, P., Liu, P.: Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160 (2025) A System Prompt Here is the system prompt utilized during both training and inference. It incor- porates detailed tool specifications and outlines the ReAct agent’s...

work page arXiv 2025