{"total":10,"items":[{"citing_arxiv_id":"2605.13034","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence","primary_cat":"cs.CV","submitted_at":"2026-05-13T05:39:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"structure via jointly evolving knowledge and outline graphs; and WebShaper [22] studies agentic synthesis of training data through formalized information seeking. Recent surveys [31, 9] provide systematic views of this rapidly growing landscape. Despite this progress, most deep research agents remain text-centered at the report level. Recent multimodal agents such as WebWatcher [5] and Vision-DeepResearch [8] extend search to multimodal evidence, while Multimodal DeepResearcher [ 27] interleaves reports with generated charts. In contrast, ViDR treats retrieved source figures as traceable evidence objects and routes them into section-level long-form report generation. 2.2 Multimodal Document and Report Generation A separate line of work studies how long documents can combine prose with visual artifacts."},{"citing_arxiv_id":"2605.10832","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents","primary_cat":"cs.CL","submitted_at":"2026-05-11T16:49:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09934","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents","primary_cat":"cs.CL","submitted_at":"2026-05-11T03:32:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent multimodal agents move beyond single-turn perception toward long-horizon interaction, where models iteratively search, inspect, calculate, and integrate information across tools and modalities. Search-R1 [ 4] and MMSearch-R1 [5] integrate external search into model reasoning, first in text-dominant settings and then in multimodal search over images and text. WebWatcher [ 6] and Vision-DeepResearch [7] study deep-research-style interaction in visually grounded environments. DeepEyesV2 [ 8], Agent0-VL [ 9], and ReAgent-V [ 10] further improve tool-mediated reasoning, training stability, and video-oriented interaction. Other approaches address unnecessary tool use, context growth, and scalable trajectory collection [ 11, 12, 13, 14], while benchmarks stress multimodal search, sustained information integration, and workflow-level"},{"citing_arxiv_id":"2605.08063","ref_index":6,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flow-OPD: On-Policy Distillation for Flow Matching Models","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:50:15+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chen, Shanghang Zhang, and Feng Zhao. Dualvla: Building a generalizable embodied agent via partial decoupling of reasoning and action.arXiv preprint arXiv:2511.22134, 2025. [5] Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. [6] Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026. [7] Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie"},{"citing_arxiv_id":"2605.08043","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:32:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07177","ref_index":12,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents","primary_cat":"cs.LG","submitted_at":"2026-05-08T03:16:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=yDKawwfJ5O. [11] Huang, W., Zeng, Y ., Wang, Q., Fang, Z., Cao, S., Chu, Z., Yin, Q., Chen, S., Yin, Z., Chen, L., et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026. [12] Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. [13] Jiang, D., Zhang, R., Guo, Z., Wu, Y ., Lei, J., Qiu, P., Lu, P., Chen, Z., Fu, C., Song, G., et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines."},{"citing_arxiv_id":"2604.17308","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2026-04-19T07:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"doi: 10.18653/v1/2025.emnlp-main.1355. URL https://aclanthology.org/ 2025.emnlp-main.1355/. [16] Shiting Huang, Zecheng Li, Yu Zeng, Qingnan Ren, Zhen Fang, Qisheng Su, Kou Shi, Lin Chen, Zehui Chen, and Feng Zhao. Internalizing meta-experience into memory for guided reinforcement learning in large language models.arXiv preprint arXiv:2602.10224, 2026. [17] Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026. [18] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik"},{"citing_arxiv_id":"2604.14029","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch","primary_cat":"cs.CV","submitted_at":"2026-04-15T16:09:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1: Overview of POINTS-Seeker.Our model adaptively interacts with exter- nal tools to solve complex, multi-hop VQA tasks. By leveraging V-Fold compression, stale history is rendered into compact visual tokens, effectively bypassing long-context performance degradation while achieving superior efficiency and reasoning fidelity. Currently, the prevailing paradigm [17,32,49,61] for developing multimodal searchagentspredominantlyreliesonthepost-trainingofgeneral-purposeLMMs (e.g., Qwen3-VL), typically through specialized supervised fine-tuning (SFT) or reinforcement learning (RL). While these methods effectively teach models to follow tool-calling protocols, they often treat agentic behavior as a superficial \"task layer\" rather than a fundamental cognitive substrate."},{"citing_arxiv_id":"2604.12890","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Long-horizon Agentic Multimodal Search","primary_cat":"cs.CV","submitted_at":"2026-04-14T15:40:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Qwen3-VL-30B- A3B-Thinking Direct Answer 7.1 2.7 13.0 17.7 w. Previous Framework10.713.6 - 53.2 w. Our Framework 9.814.4 16.0 62.0 Table 3: Performance comparison among different frameworks. • Multimodal Search Agents. We compare against existing open-source multimodal agents, in- cluding MMSearch-R1 [61], WebWatcher [6], DeepEyesV2 [21], Vision-DeepResearch [7], and REDSearcher-MM [8]. These methods typically combine perception, reasoning, and external search to address complex multimodal queries. Implementation Details.We build our framework based on MiroFlow [ 62], and utilize it for both trajectory rollout and answer verification. We utilize LLaMA-Factory [63] as the training framework. We train the model for 3 epochs, with a global batch size of 64 and a learning rate of 1e-5."},{"citing_arxiv_id":"2603.28767","ref_index":33,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-03-30T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}