arxiv: 2605.13034 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.IR

Recognition: 2 theorem links

· Lean Theorem

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

Zhuofan Shi , Peilun Jia , Baoqin Sun , Haiyang Shen , Sixiong Xie , Yun Ma , Xiang Jing

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:15 UTC · model grok-4.3

classification 💻 cs.CV cs.IR

keywords ViDRsource visual evidencemultimodal deep researchevidence groundingvisual language modelsreport verifiabilityMMR Bench+source figure integration

0 comments

The pith

Treating source figures as verifiable evidence objects improves the quality and verifiability of multimodal deep research reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most current deep research systems rely on text evidence or generate their own charts, leaving real source images underused. ViDR builds a framework that treats source figures as first-class evidence objects that can be retrieved, filtered, interpreted, and verified. It refines noisy web images into reliable evidence atoms using context-aware filtering, outline-aware reranking, and visual language model analysis, then links claims to visuals through an evidence-indexed outline. Validation steps cut misplaced or hallucinated figures, and a new benchmark measures retrieval, placement, interpretation, and verifiability. If successful, this makes long-form AI reports more trustworthy by anchoring them in original visual sources rather than synthesized content.

Core claim

ViDR is a multimodal deep research framework that grounds long-form reports in source figures treated as retrievable, interpretable, routable, and verifiable evidence objects. It constructs an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, generates each section with section-specific evidence, and validates visual references to reduce hallucinated or misplaced figures. Experiments show improvements in overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines on a

What carries the argument

The evidence-indexed outline that links claims to textual and visual evidence, supported by context-aware filtering, outline-aware reranking, and VLM-based visual analysis that converts noisy web images into reliable source-figure evidence atoms.

If this is right

Each claim in the report can be directly linked to specific source visuals for stronger evidential grounding.
Visual support for claims becomes more accurate because figures come from original sources rather than generated approximations.
Report verifiability increases through explicit, checkable connections between text sections and source figures.
Systems retain the option to generate analytical charts when no suitable source figure exists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering and validation steps could extend to other evidence types such as data tables or video clips in research reports.
Evaluation of AI research tools may shift toward measuring accuracy of evidence citation rather than fluency alone.
Domain-specific tests in fields like scientific literature could check whether source-figure grounding reduces misinterpretation of data visuals.

Load-bearing premise

Context-aware filtering, outline-aware reranking, and VLM-based visual analysis can reliably turn noisy web images into accurate, non-hallucinated evidence atoms without introducing new errors that affect report claims.

What would settle it

A generated report in which a referenced source figure is placed or interpreted in a way that directly contradicts the visual content of the actual image, producing a verifiable factual error traceable to the visual evidence step.

Figures

Figures reproduced from arXiv: 2605.13034 by Baoqin Sun, Haiyang Shen, Peilun Jia, Sixiong Xie, Xiang Jing, Yun Ma, Zhuofan Shi.

**Figure 2.** Figure 2: Overview of the ViDR pipeline: A multimodal research, B image enrichment, C evidenceindexed planning, and D grounded sectionwise generation. applies a conservative extraction-time filter fpre that removes visually non-evidential web assets using metadata-level cues, producing a tractable candidate set for subsequent semantic and visual reasoning. This filter does not decide final report usage; final selec… view at source ↗

**Figure 3.** Figure 3: Domain distribution of the 160 research queries in MMR Bench+, spanning 16 domains [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: An excerpt from a report generated by ViDR. The report interleaves retrieved source [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViDR sketches a pipeline for turning web images into verifiable evidence atoms in research reports, but the abstract supplies no metrics or details so the improvement claims stay uncheckable.

read the letter

ViDR treats source figures as first-class evidence by running context-aware filtering, outline-aware reranking, and VLM analysis on noisy web images, then links them to an evidence-indexed outline. That framing is a clear step beyond systems that either skip visuals or just generate charts from scratch. The new MMR Bench+ also looks useful on paper because it targets retrieval, placement, interpretation, and verifiability together rather than isolated tasks. The paper earns credit for naming the gap in current multimodal report generators and for keeping analytical chart generation as an option instead of forcing everything through retrieved figures. The central weakness is that the abstract asserts gains in report quality, integration, and verifiability over commercial and open-source baselines yet gives no numbers, no baseline names, no ablation results, and no operational definition of verifiability. Without those, it is impossible to judge whether the pipeline reduces hallucinations or simply relocates them. The full methods and benchmark construction details are also missing from what is visible, so the soundness of the experimental setup cannot be checked. This work is aimed at groups building long-form multimodal agents who already care about evidence grounding. A reader in that area could pick up the high-level pipeline and the benchmark idea, but would still need the missing results before treating the claims as actionable. I would send it to peer review so the authors can supply the quantitative evaluation and we can see whether the benchmark actually measures what it claims.

Referee Report

2 major / 2 minor

Summary. The paper introduces ViDR, a multimodal deep research framework that grounds long-form reports in source figures by treating them as retrievable, interpretable evidence objects. It employs context-aware filtering, outline-aware reranking, and VLM-based visual analysis to refine noisy web images into evidence atoms, builds an evidence-indexed outline linking claims to textual and visual evidence, generates sections with section-specific evidence, and validates visual references to reduce hallucinations. The work also proposes MMR Bench+ for evaluating source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation, claiming experimental improvements in report quality, figure integration, and verifiability over commercial and open-source baselines.

Significance. If the experimental claims hold, the work would be significant for multimodal AI by shifting focus from generated charts to source visual evidence, potentially improving grounding and reducing hallucinations in long-form reports. The introduction of MMR Bench+ provides a useful new evaluation resource for visual evidence use. The pipeline's emphasis on routable and verifiable evidence atoms offers a concrete direction for future systems.

major comments (2)

[Abstract] Abstract: The central claim that 'Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines' is unsupported by any quantitative metrics, baseline details, ablation results, statistical significance tests, or error analysis. This absence is load-bearing because the improvements cannot be assessed or reproduced from the provided information.
[§3 and §4] §4 (Experiments) and §3 (Pipeline): The description of how context-aware filtering, outline-aware reranking, and VLM-based analysis convert noisy web images into accurate evidence atoms lacks implementation specifics, pseudocode, or ablation studies. Without these, it is impossible to verify whether the pipeline reduces hallucinations or merely relocates errors, directly affecting the weakest assumption in the central claim.

minor comments (2)

[Introduction] Introduction: The term 'evidence atoms' is used repeatedly but never given a formal definition or illustrative example, which could improve clarity for readers.
[Related Work] Related Work: The discussion of prior multimodal retrieval systems could benefit from additional citations to recent VLM-based grounding papers to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We agree that the experimental claims in the abstract and the implementation details in the pipeline require stronger quantitative support and reproducibility elements to fully substantiate our contributions. We address each major comment below and commit to revisions that will incorporate the suggested enhancements.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines' is unsupported by any quantitative metrics, baseline details, ablation results, statistical significance tests, or error analysis. This absence is load-bearing because the improvements cannot be assessed or reproduced from the provided information.

Authors: We acknowledge that the abstract presents a high-level summary of the experimental outcomes without embedding specific metrics. The full manuscript in Section 4 reports quantitative results, including human-rated report quality scores, source-figure integration precision/recall, verifiability rates (measured via reference validation accuracy), and comparisons against baselines such as commercial systems and open-source multimodal agents, along with statistical significance where applicable. To directly address the concern, we will revise the abstract to include key quantitative highlights (e.g., relative improvements in verifiability and integration) and add a concise results summary table. We will also expand the experiments section with explicit baseline configurations, full ablation tables, and error analysis. These changes will make the claims fully assessable and reproducible. revision: yes
Referee: [§3 and §4] §4 (Experiments) and §3 (Pipeline): The description of how context-aware filtering, outline-aware reranking, and VLM-based analysis convert noisy web images into accurate evidence atoms lacks implementation specifics, pseudocode, or ablation studies. Without these, it is impossible to verify whether the pipeline reduces hallucinations or merely relocates errors, directly affecting the weakest assumption in the central claim.

Authors: We agree that greater implementation transparency is needed to demonstrate the pipeline's effectiveness in reducing hallucinations. Section 3 currently outlines the three stages at a conceptual level; in the revision we will add detailed pseudocode for context-aware filtering (including scoring functions and thresholds), outline-aware reranking (with similarity metrics and reranking algorithm), and VLM-based visual analysis (prompt templates and output parsing). We will also insert ablation studies in Section 4 that quantify the contribution of each component to evidence accuracy, hallucination reduction, and overall report verifiability, plus an error analysis categorizing remaining failure modes. These additions will clarify that the pipeline improves grounding rather than relocating errors. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical multimodal system ViDR whose components (context-aware filtering, outline-aware reranking, VLM-based visual analysis, evidence-indexed outlines) are presented as engineering choices rather than derived quantities. Claims of improvement rest on external baseline comparisons on MMR Bench+ rather than any internal equations, fitted parameters, or self-referential predictions. No derivation chain exists that reduces outputs to inputs by construction, and no self-citation load-bearing steps are identifiable from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that vision-language models can perform reliable context-aware filtering and visual analysis on web-sourced images to produce usable evidence atoms.

axioms (1)

domain assumption Vision-language models can reliably interpret and filter noisy web images into accurate evidence atoms
Invoked in the refinement pipeline and section-specific evidence generation steps described in the abstract.

invented entities (1)

evidence atoms no independent evidence
purpose: Modular, routable units of source visual evidence for report grounding
New conceptual object introduced to treat figures as retrievable and verifiable items rather than optional illustrations.

pith-pipeline@v0.9.0 · 5550 in / 1289 out tokens · 37340 ms · 2026-05-14T19:15:11.787518+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects... context-aware filtering, outline-aware reranking, and VLM-based visual analysis
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

[1]

Try deep research and our new experimental model in gemini, your ai assistant

Dave Citron. Try deep research and our new experimental model in gemini, your ai assistant. Google Blog (Gemini), December 2024. URL https://blog.google/ products-and-platforms/products/gemini/google-gemini-deep-research/ . Ac- cessed: 2026-01-28

work page 2024
[2]

Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253, 2025

João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, et al. Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253, 2025

work page arXiv 2025
[3]

InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1610–1630, Vienna, Austria

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

work page arXiv 2025
[4]

gpt-researcher

Assaf Felovic. gpt-researcher. https://github.com/assafelovic/gpt-researcher. GitHub repository. Accessed: 2025-12-29

work page 2025
[5]

Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page arXiv 2025
[6]

Gemini deep research — your personal research assistant

Google. Gemini deep research — your personal research assistant. https://gemini.google/ overview/deep-research/, 2025. Accessed: 2025-12-29

work page 2025
[7]

Mmdeepresearch-bench: A benchmark for multimodal deep research agents, 2026

Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, Chaofan Tao, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, and Mi Zhang. Mmdeepresearch-bench: A benchmark for multimodal deep research agents, 2026. URLhttps://arxiv.org/abs/2601.12346

work page arXiv 2026
[8]

Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Xu Tang, Yao Hu, Shaohui Lin, Philip Torr, Feng Zhao, and Wanli Ouyang. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models, 2026. URLhttps://arxiv.org/abs/2601.22060

work page arXiv 2026
[9]

Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

work page arXiv 2025
[10]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

open_deep_research

LangChain. open_deep_research. https://github.com/langchain-ai/open_deep_ research. GitHub repository. Accessed: 2025-12-29

work page 2025
[12]

Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report.arXiv preprint arXiv:2601.08536, 2026

Ruizhe Li, Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report.arXiv preprint arXiv:2601.08536, 2026

work page arXiv 2026
[13]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

work page 2025
[14]

Webthinker: Empowering large reasoning models with deep research capability,

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

work page arXiv 2025
[15]

Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025

Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, et al. Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025. 11

work page arXiv 2025
[16]

Evid- fuse: Writing-time evidence learning for consistent text-chart data reporting.arXiv preprint arXiv:2601.05487, 2026

Huanxiang Lin, Qianyue Wang, Jinwu Hu, Bailin Chen, Qing Du, and Mingkui Tan. Evid- fuse: Writing-time evidence learning for consistent text-chart data reporting.arXiv preprint arXiv:2601.05487, 2026

work page arXiv 2026
[17]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[18]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, February 2025. Published: 2025-02-02. Accessed: 2025-12-29

work page 2025
[19]

Deepscholar-bench: A live benchmark and automated evaluation for generative research synthesis.arXiv preprint arXiv:2508.20033, 2025

Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, and Carlos Guestrin. Deepscholar-bench: A live benchmark and automated evaluation for generative research synthesis.arXiv preprint arXiv:2508.20033, 2025

work page arXiv 2025
[20]

Assisting in writing wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing wikipedia-like articles from scratch with large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6252–6278, 2024

work page 2024
[21]

A tale of two graphs: Separating knowledge exploration from outline structure for open-ended deep research.arXiv preprint arXiv:2602.13830, 2026

Zhuofan Shi, Ming Ma, Zekun Yao, Fangkai Yang, Jue Zhang, Dongge Han, Victor Rühle, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. A tale of two graphs: Separating knowledge exploration from outline structure for open-ended deep research.arXiv preprint arXiv:2602.13830, 2026

work page arXiv 2026
[22]

Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

work page arXiv 2025
[23]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

work page internal anchor Pith review arXiv 2025
[24]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

work page arXiv 2025
[26]

Beyond outlining: Heterogeneous recursive planning for adaptive long-form writing with language models

Ruibin Xiong, Yimeng Chen, Dmitrii Khizbullin, Mingchen Zhuge, and Jürgen Schmidhuber. Beyond outlining: Heterogeneous recursive planning for adaptive long-form writing with language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24689–24725, 2025

work page 2025
[27]

Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework

Zhaorui Yang, Bo Pan, Han Wang, Yiyao Wang, Xingyu Liu, Luoxuan Weng, Yingchaojie Feng, Haozhe Feng, Minfeng Zhu, Bo Zhang, et al. Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34368–34377, 2026

work page 2026
[28]

Miroeval: Benchmarking multimodal deep research agents in process and outcome.arXiv preprint arXiv:2603.28407, 2026

Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, et al. Miroeval: Benchmarking multimodal deep research agents in process and outcome.arXiv preprint arXiv:2603.28407, 2026

work page arXiv 2026
[29]

Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Fangda Ye, Zhifei Xie, Yuxin Hu, Yihang Yin, Shurui Huang, Shikai Dong, Jianzhu Bao, and Shuicheng Yan. Deep-reporter: Deep research for grounded multimodal long-form generation. arXiv preprint arXiv:2604.10741, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models, 2026

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, and Shaosheng Cao. Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models, 2026. URL https://arxiv.org/ abs/2602.02185

work page arXiv 2026
[31]

monthly performance,

Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025. 13 15 15 1514 13 13 11 10 10 10 9 8 6 4 4 3 Tech & Media Healthcare AI & ML Climate & Env Agri & Food Economy & Work Econ & Policy Systems & HW Biomed & Hea...

work page arXiv 2025
[32]

Visual Features: what is explicitly visible in the image

work page
[33]

Deductive Fact: the factual or quantitative claim supported by the image

work page
[34]

image_id

Rationale: how the image acts as evidence for the broader topic. Also return up to {question_num} follow-up questions for further research. <contents>{contents}</contents> G.3 Stage A : Adaptive Outline Update and Query Guidance PROMPTTEMPLATE: ADAPTIVEOUTLINE ANDQUERYGUIDANCE Outline update system: You are an expert research planner. Maintain a living Ma...

work page
[35]

Improve structure, expression, analytical presentation, and readability

work page
[36]

Important rules: - Use only information already present in the report, supplied learnings, and media inventory

Improve evidential support, precision, credibility, and verifiability. Important rules: - Use only information already present in the report, supplied learnings, and media inventory. Do not invent facts, data, sources, or claims. - Preserve every [[MEDIA_ANCHOR_xxx]] token exactly once. - Keep links, source names, citations, footnote-style references, and...

work page