arxiv: 2604.19264 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

Shengqin Wang , Wentao Yan , Huichi Zhou , Yihang Chen , Kun Shao , Zhizhong Zhang , Yuan Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal search agentsdeep reasoningreinforcement learningadvantage estimationGaussian rewardsinteraction collapsemulti-step QAFVQA

0 comments

The pith

DR-MMSearchAgent derives batch-wide advantage signals from full trajectories and applies differentiated Gaussian rewards to keep multimodal search agents from collapsing into short or repetitive interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal search agents often stop too soon because terminal rewards fail to credit exploratory paths and because redundant context blocks useful feedback. The DR-MMSearchAgent framework computes advantage signals over entire rollout trajectories in a batch by measuring their structural proximity, which favors generation of different-length paths that still reach the correct answer. Differentiated Gaussian rewards then adjust interaction tolerance dynamically to maintain reliable information while trimming excess context. A supporting dataset of 3602 multi-step QA pairs, each needing at least three reasoning steps, enables the training. Experiments report an 8.4 percent gain over the prior MMSearch-R1 baseline on the FVQA test set.

Core claim

The DR-MMSearchAgent framework addresses premature interaction collapse in agentic multimodal models by deriving advantage signals from whole rollout trajectories within an entire batch using structural proximity, thereby encouraging trajectories of varying lengths even when they share the same correct answer, and by employing differentiated Gaussian rewards to dynamically calibrate interaction tolerance for better information reliability and reduced redundancy. This is enabled by a constructed multi-step deep-reasoning dataset containing 3602 high-quality QA pairs each requiring at least three reasoning steps.

What carries the argument

Structural proximity across batch trajectories for deriving advantage signals over full rollouts, combined with differentiated Gaussian rewards that dynamically calibrate interaction tolerance.

If this is right

Trajectories of different lengths receive positive advantage signals even when they reach the same final answer.
Redundant context is reduced while interaction feedback remains reliable.
Multi-turn training on complex multimodal tasks becomes more stable.
The method yields an 8.4 percent accuracy improvement over MMSearch-R1 on FVQA-test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar batch-proximity advantage estimation could apply to other variable-length reinforcement-learning agent tasks beyond search.
The multi-step dataset construction approach offers a reusable template for creating training data that requires chained reasoning.
Differentiated Gaussian reward shaping might reduce context bloat in long-horizon language-model agents more generally.

Load-bearing premise

The assumption that structural proximity across batch trajectories and differentiated Gaussian rewards will reliably distinguish exploratory behavior and reduce redundancy without introducing new instabilities or requiring extensive hyperparameter tuning.

What would settle it

An ablation that removes the batch structural proximity, retrains the agent, and shows no reduction in average trajectory length or final accuracy on FVQA-test would falsify the central benefit of the advantage mechanism.

Figures

Figures reproduced from arXiv: 2604.19264 by Huichi Zhou, Kun Shao, Shengqin Wang, Wentao Yan, Yihang Chen, Yuan Xie, Zhizhong Zhang.

**Figure 1.** Figure 1: (a) Radar chart of comparisons on accuracy. (b) Comparison of trajectory diversity within the same reward and redundancy (non-zero reward). Diverse Trajectories denote the percentage of trajectories with varying response lengths under identical rewards, while Redundant Trajectories represent the percentage of trajectories with duplicated rewards and identical response lengths. (c) Comparison of our advan… view at source ↗

**Figure 2.** Figure 2: Framework for DR-MMSearchAgent. The upper panel illustrates the overall training framework, while the lower panel depicts the advantage injection mechanism, the adaptive reward mechanism, the refining agent, and the updatable multimodal tool services [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Schematic of the BridgeVQA construction workflow, featuring data collection, two-step problem generation, and dualphase filtering. other positions set to zero: ri,t = ( r(τ ), if t = Li 0, if 1 ≤ t < Li . (2) The R′ provide the positional information or CoT length of reward r(τ ) by considering the whole roll-out trajectories in an entire batch. Notably, as illustrated in [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 4.** Figure 4: The comparison of multi-step interaction. Unlike traditional baseline models that exhibit short-sighted exploration and redundant trajectories, our framework enables stable and reliable interactions. It achieves further performance improvements while increasing the number of model interactions and reducing their corresponding lengths. The Pipline of BridgeVQA: As shown in [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 5.** Figure 5: Advantage distribution in a training batch of baseline and our framework. optimization. Specific configuration details include a batch size of 128, with 8 rollouts per problem and temperature is set to 0. The maximum interaction rounds were set to 14, and the model was trained for 2 epochs. For evaluating rewards based on training outcomes, we deployed Qwen3- 14B (Yang et al., 2025a) locally. Our method em… view at source ↗

**Figure 6.** Figure 6: (a) Advantage distribution of GRPO and ours in FVQA. (b) Training score of GRPO and ours in FVQA. (c) Response length of GRPO and ours in FVQA. (d) Turns of GRPO and ours in Geo3k. soning is insufficient for this task, as its knowledge-intensive nature requires the integration of external knowledge and multi-turn interactive reasoning to generate accurate solutions. Our model achieves SOTA performance com… view at source ↗

**Figure 7.** Figure 7: SPAI Visual Analysis. The upper figure shows the analysis of isomerism phenomenon, and the lower figure shows the scores of different rewards from a global perspective. to the positive ideal solution Z +: D+ a = s (za,ta − z + ta ) 2 + X t̸=ta (0 − z + t ) 2 (23) D + b = s (zb,tb − z + tb ) 2 + X t̸=tb (0 − z + t ) 2 (24) Since za,ta is the contribution to the ideal vector at ta (assuming R is the max rew… view at source ↗

**Figure 8.** Figure 8: , this method can effectively solve the current tool redundancy problem and achieve a compression efficiency of over three-quarters. Input Query: New START treaty and number of deployed long-range nuclear warheads Raw Search Results (Redundant) ’result’: [[’document’: ’id’: ’3255926’, ’contents’: "The ’New START’ treaty is an agreement by both the US and Russian governments to limit the deploying of nuclea… view at source ↗

**Figure 9.** Figure 9: Statistical information of BridgeVQA. It shows the diversity sources of the dataset, with 875 sources appearing more than 5 times, and the remaining 2727 sources. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: The proposed updatable tool framework. Our tool services include text search and image search. The text search knowledge base is composed of sliced web page information and wikis, while the image search is powered by Google Search and stored in a cached format. D. Difficulty test of BridgeVQA We conducted a test on the difficulty level of BridgeVQA. Specifically, we allocated another part as the test set.… view at source ↗

read the original abstract

Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multi-turn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4$\%$ on FVQA-test.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The batch structural proximity for advantages plus differentiated Gaussian rewards target real collapse issues in multimodal agents, but the 8.4% gain may trace more to the new dataset than the proposed signals.

read the letter

The paper focuses on premature interaction collapse in multimodal search agents. It identifies two causes: terminal rewards that fail to credit exploratory trajectories of different lengths, and redundant context that blocks feedback. The fix derives advantage signals from whole batch trajectories using structural proximity, which encourages varied paths even when they reach the same answer. Differentiated Gaussian rewards then adjust interaction tolerance to keep information reliable and cut redundancy. They also built a 3602-pair multi-step dataset to support this training.

Referee Report

2 major / 1 minor

Summary. The paper proposes DR-MMSearchAgent to mitigate premature interaction collapse in multimodal search agents. It introduces two mechanisms: (1) using structural proximity across entire batch rollout trajectories to derive advantage signals that encourage generation of trajectories of varying lengths even with the same correct answer, and (2) differentiated Gaussian rewards to dynamically calibrate interaction tolerance, reduce redundancy, and ensure information reliability. The authors also construct a new multi-step deep-reasoning dataset of 3602 high-quality QA pairs requiring at least three reasoning steps. Extensive experiments are claimed to show state-of-the-art performance, with an 8.4% improvement over MMSearch-R1 on FVQA-test.

Significance. If the performance gains can be causally attributed to the proposed mechanisms rather than dataset effects alone, the work could meaningfully advance RL-based training for agentic multimodal models by addressing collapse and context redundancy. The construction of a multi-step reasoning dataset is a concrete contribution that may support future research in this area.

major comments (2)

[Abstract] Abstract: The central claim of an 8.4% improvement over MMSearch-R1 on FVQA-test is stated without any reference to experimental details, baselines, ablation studies, statistical significance, or error analysis. This prevents assessment of whether the gains arise from the structural proximity advantage estimation or the differentiated Gaussian rewards.
[Abstract] Abstract: No ablation or controlled comparison is described to isolate the contributions of batch-wide structural proximity for advantage signals and differentiated Gaussian rewards from the effects of the newly introduced 3602-pair multi-step dataset. Without such evidence, the attribution of gains to the proposed RL mechanisms (rather than data scale or quality) remains unverified and load-bearing for the SOTA claim.

minor comments (1)

[Abstract] Abstract: The phrase 'extensive experiments' is used but no specific metrics, additional test sets, or comparison methods beyond MMSearch-R1 are mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight the need for greater clarity in the abstract regarding experimental details and the isolation of our proposed mechanisms from dataset effects. We address each point below and commit to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of an 8.4% improvement over MMSearch-R1 on FVQA-test is stated without any reference to experimental details, baselines, ablation studies, statistical significance, or error analysis. This prevents assessment of whether the gains arise from the structural proximity advantage estimation or the differentiated Gaussian rewards.

Authors: We agree that the abstract is too concise and lacks sufficient context on the supporting experiments. In the revised manuscript, we will expand the abstract to reference the key baselines (MMSearch-R1 and others), the FVQA-test evaluation, and point to the ablation studies and statistical details presented in Sections 4 and 5. These sections include the full comparison results, component ablations, and error analysis that substantiate the reported 8.4% gain. revision: yes
Referee: [Abstract] Abstract: No ablation or controlled comparison is described to isolate the contributions of batch-wide structural proximity for advantage signals and differentiated Gaussian rewards from the effects of the newly introduced 3602-pair multi-step dataset. Without such evidence, the attribution of gains to the proposed RL mechanisms (rather than data scale or quality) remains unverified and load-bearing for the SOTA claim.

Authors: We acknowledge that the current manuscript does not include explicit controlled experiments that train the baseline MMSearch-R1 on our new 3602-pair dataset or fully ablate the individual RL components while holding the dataset fixed. While our experiments compare DR-MMSearchAgent against MMSearch-R1 and other methods on the new dataset and include component-wise ablations, dedicated isolation studies would provide stronger causal evidence. We will add these controlled comparisons and ablations in the revised version to better attribute performance gains to the structural proximity advantage estimation and differentiated Gaussian rewards. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims are empirical and self-contained

full rationale

The paper proposes two mechanisms (batch structural proximity for advantage signals and differentiated Gaussian rewards) to address premature collapse, supported by a newly constructed 3602-pair training dataset. No equations, derivations, or parameter-fitting steps are described that reduce the 8.4% FVQA-test gain to a fitted input, self-definition, or self-citation chain. Results are presented as held-out empirical outcomes rather than predictions forced by construction. The central performance claim rests on external evaluation rather than internal redefinition of inputs, making the derivation self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, or new postulated entities are described in the abstract; the work is presented as an empirical engineering contribution.

pith-pipeline@v0.9.0 · 5510 in / 1119 out tokens · 36109 ms · 2026-05-10T03:18:07.158055+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 33 canonical work pages · 12 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

Structure and Interpretation of Computer Programs

Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985

1985
[10]

Visual Information Extraction with Lixto

Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001

2001
[11]

Brachman and James G

Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985

1985
[12]

Complexity results for nonmonotonic logics

Georg Gottlob. Complexity results for nonmonotonic logics. Journal of Logic and Computation. 1992

1992
[13]

Hypertree Decompositions and Tractable Queries

Georg Gottlob and Nicola Leone and Francesco Scarcello. Hypertree Decompositions and Tractable Queries. Journal of Computer and System Sciences. 2002

2002
[14]

Levesque

Hector J. Levesque. Foundations of a functional approach to knowledge representation. Artificial Intelligence. 1984

1984
[15]

Levesque

Hector J. Levesque. A logic of implicit and explicit belief. Proceedings of the Fourth National Conference on Artificial Intelligence. 1984

1984
[16]

On the compilability and expressive power of propositional planning formalisms

Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000

2000
[17]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

2024
[18]

S., and Krishna, R

LiveVQA: Live Visual Knowledge Seeking , author=. arXiv preprint arXiv:2504.05288 , year=

work page arXiv
[19]

Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

MMSearch-R1: Incentivizing LMMs to Search , author=. arXiv preprint arXiv:2506.20670 , year=

work page arXiv
[20]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[22]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[23]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=
[24]

M., Shiee, N., Grasch, P., Jia, C., Yang, Y ., and Gan, Z

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search , author=. arXiv preprint arXiv:2510.12801 , year=

work page arXiv
[25]

LLaVA-OneVision: Easy Visual Task Transfer

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review arXiv
[26]

Aria: An open multimodal native mixture-of-experts model

Aria: An open multimodal native mixture-of-experts model , author=. arXiv preprint arXiv:2410.05993 , year=

work page arXiv
[27]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

work page internal anchor Pith review arXiv
[28]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Mv-math: Evaluating multimodal math reasoning in multi-visual contexts , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[30]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

The Twelfth International Conference on Learning Representations , year=

Grounding multimodal large language models to the world , author=. The Twelfth International Conference on Learning Representations , year=
[33]

arXiv preprint arXiv:2504.21277 , year=

Reinforced mllm: A survey on rl-based reasoning in multimodal large language models , author=. arXiv preprint arXiv:2504.21277 , year=

work page arXiv
[34]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

R1-searcher: Incentivizing the search capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2503.05592 , year=

work page internal anchor Pith review arXiv
[35]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review arXiv
[36]

arXiv preprint arXiv:2302.11713 , year=

Can pre-trained vision and language models answer visual information-seeking questions? , author=. arXiv preprint arXiv:2302.11713 , year=

work page arXiv
[37]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Simplevqa: Multimodal factuality evaluation for multimodal large language models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[38]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024 , author=. URL https://arxiv. org/abs/2402.03300 , volume=

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments , author=. arXiv preprint arXiv:2504.03160 , year=

work page arXiv
[40]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
[41]

Learning to reason with search for llms via reinforcement learning,

Learning to reason with search for llms via reinforcement learning , author=. arXiv preprint arXiv:2503.19470 , year=

work page arXiv
[42]

Serper , title =. n.d. , url =
[43]

HybridFlow: A Flexible and Efficient RLHF Framework , url=

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , year=. HybridFlow: A Flexible and Efficient RLHF Framework , url=. doi:10.1145/3689031.3696075 , booktitle=

work page doi:10.1145/3689031.3696075
[44]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

2023
[45]

Reinforcement Learning: An Introduction , author=
[46]

RSS , year=

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , author=. RSS , year=
[47]

2025 , eprint=

WebWalker: Benchmarking LLMs in Web Traversal , author=. 2025 , eprint=

2025
[48]

2025 , eprint=

WebThinker: Empowering Large Reasoning Models with Deep Research Capability , author=. 2025 , eprint=

2025
[49]

arXiv , year=

Rewarding Language Models for Improved Factuality , author=. arXiv , year=
[50]

Criticbench: Benchmarking llms for critique-correct reasoning.arXiv preprint arXiv:2402.14809, 2024

Criticbench: Benchmarking llms for critique-correct reasoning , author=. arXiv preprint arXiv:2402.14809 , year=

work page arXiv
[51]

2024 , eprint=

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. 2024 , eprint=

2024
[52]

ICML , year=

REALM: Retrieval-Augmented Language Model Pre-Training , author=. ICML , year=
[53]

EMNLP , year=

Dense Passage Retrieval for Open-Domain Question Answering , author=. EMNLP , year=
[54]

NeurIPS , year=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. NeurIPS , year=
[55]

arXiv preprint arXiv:2406.04381 , year=

A Survey on Multimodal Large Language Models , author=. arXiv preprint arXiv:2406.04381 , year=

work page arXiv
[56]

CVPR , year=

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , author=. CVPR , year=
[57]

2025 , eprint=

Towards Rationality in Language and Multimodal Agents: A Survey , author=. 2025 , eprint=

2025
[58]

Wikimedia Foundation , title =. n.d. , url =
[59]

2024 , eprint=

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection , author=. 2024 , eprint=

2024
[60]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Murag: Multimodal retrieval-augmented generator for open question answering over images and text , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[61]

arXiv preprint arXiv:2407.21439 , year=

Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training , author=. arXiv preprint arXiv:2407.21439 , year=

work page arXiv
[62]

Advances in Neural Information Processing Systems , volume=

Rankrag: Unifying context ranking with retrieval-augmented generation in llms , author=. Advances in Neural Information Processing Systems , volume=
[63]

Advances in Neural Information Processing Systems , volume=

Avis: Autonomous visual information seeking with large language model agent , author=. Advances in Neural Information Processing Systems , volume=
[64]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

work page arXiv
[65]

arXiv preprint arXiv:2410.21220 , year=

Vision search assistant: Empower vision-language models as multimodal search engines , author=. arXiv preprint arXiv:2410.21220 , year=

work page arXiv
[66]

https://openai.com/index/thinking-with-images/ , year=

Thinking with images , author=. https://openai.com/index/thinking-with-images/ , year=
[67]

2011 , publisher=

Multiple attribute decision making: methods and applications , author=. 2011 , publisher=

2011
[68]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[70]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

Webwatcher: Breaking new frontier of vision-language deep research agent , author=. arXiv preprint arXiv:2508.05748 , year=

work page arXiv
[72]

Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles.arXiv preprint arXiv:2503.17352, 2025

Openvlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement , author=. arXiv preprint arXiv:2503.17352 , year=

work page arXiv
[73]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. arXiv preprint arXiv:2504.08837 , year=

work page Pith review arXiv
[74]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement , author=. arXiv preprint arXiv:2504.07934 , year=

work page arXiv
[75]

arXiv preprint arXiv:2506.07905 , year=

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning , author=. arXiv preprint arXiv:2506.07905 , year=

work page arXiv
[76]

Shuffle-r1: Efficient rl framework for multimodal large language models via data-centric dynamic shuffle.arXiv preprint arXiv:2508.05612, 2025

Shuffle-r1: Efficient rl framework for multimodal large language models via data-centric dynamic shuffle , author=. arXiv preprint arXiv:2508.05612 , year=

work page arXiv
[77]

European Conference on Computer Vision , pages=

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[78]

Advances in Neural Information Processing Systems , volume=

Measuring multimodal mathematical reasoning with math-vision dataset , author=. Advances in Neural Information Processing Systems , volume=
[79]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=

work page internal anchor Pith review arXiv
[80]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Showing first 80 references.