pith. machine review for the scientific record. sign in

arxiv: 2605.13034 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.IR

Recognition: 2 theorem links

· Lean Theorem

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:15 UTC · model grok-4.3

classification 💻 cs.CV cs.IR
keywords ViDRsource visual evidencemultimodal deep researchevidence groundingvisual language modelsreport verifiabilityMMR Bench+source figure integration
0
0 comments X

The pith

Treating source figures as verifiable evidence objects improves the quality and verifiability of multimodal deep research reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most current deep research systems rely on text evidence or generate their own charts, leaving real source images underused. ViDR builds a framework that treats source figures as first-class evidence objects that can be retrieved, filtered, interpreted, and verified. It refines noisy web images into reliable evidence atoms using context-aware filtering, outline-aware reranking, and visual language model analysis, then links claims to visuals through an evidence-indexed outline. Validation steps cut misplaced or hallucinated figures, and a new benchmark measures retrieval, placement, interpretation, and verifiability. If successful, this makes long-form AI reports more trustworthy by anchoring them in original visual sources rather than synthesized content.

Core claim

ViDR is a multimodal deep research framework that grounds long-form reports in source figures treated as retrievable, interpretable, routable, and verifiable evidence objects. It constructs an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, generates each section with section-specific evidence, and validates visual references to reduce hallucinated or misplaced figures. Experiments show improvements in overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines on a

What carries the argument

The evidence-indexed outline that links claims to textual and visual evidence, supported by context-aware filtering, outline-aware reranking, and VLM-based visual analysis that converts noisy web images into reliable source-figure evidence atoms.

If this is right

  • Each claim in the report can be directly linked to specific source visuals for stronger evidential grounding.
  • Visual support for claims becomes more accurate because figures come from original sources rather than generated approximations.
  • Report verifiability increases through explicit, checkable connections between text sections and source figures.
  • Systems retain the option to generate analytical charts when no suitable source figure exists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same filtering and validation steps could extend to other evidence types such as data tables or video clips in research reports.
  • Evaluation of AI research tools may shift toward measuring accuracy of evidence citation rather than fluency alone.
  • Domain-specific tests in fields like scientific literature could check whether source-figure grounding reduces misinterpretation of data visuals.

Load-bearing premise

Context-aware filtering, outline-aware reranking, and VLM-based visual analysis can reliably turn noisy web images into accurate, non-hallucinated evidence atoms without introducing new errors that affect report claims.

What would settle it

A generated report in which a referenced source figure is placed or interpreted in a way that directly contradicts the visual content of the actual image, producing a verifiable factual error traceable to the visual evidence step.

Figures

Figures reproduced from arXiv: 2605.13034 by Baoqin Sun, Haiyang Shen, Peilun Jia, Sixiong Xie, Xiang Jing, Yun Ma, Zhuofan Shi.

Figure 1
Figure 1. Figure 1: Comparison of deep research report paradigms. Text-centric systems omit visual evidence, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ViDR pipeline: A multimodal research, B image enrichment, C evidence￾indexed planning, and D grounded sectionwise generation. applies a conservative extraction-time filter fpre that removes visually non-evidential web assets using metadata-level cues, producing a tractable candidate set for subsequent semantic and visual reasoning. This filter does not decide final report usage; final selec… view at source ↗
Figure 3
Figure 3. Figure 3: Domain distribution of the 160 research queries in MMR Bench+, spanning 16 domains [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An excerpt from a report generated by ViDR. The report interleaves retrieved source [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ViDR, a multimodal deep research framework that grounds long-form reports in source figures by treating them as retrievable, interpretable evidence objects. It employs context-aware filtering, outline-aware reranking, and VLM-based visual analysis to refine noisy web images into evidence atoms, builds an evidence-indexed outline linking claims to textual and visual evidence, generates sections with section-specific evidence, and validates visual references to reduce hallucinations. The work also proposes MMR Bench+ for evaluating source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation, claiming experimental improvements in report quality, figure integration, and verifiability over commercial and open-source baselines.

Significance. If the experimental claims hold, the work would be significant for multimodal AI by shifting focus from generated charts to source visual evidence, potentially improving grounding and reducing hallucinations in long-form reports. The introduction of MMR Bench+ provides a useful new evaluation resource for visual evidence use. The pipeline's emphasis on routable and verifiable evidence atoms offers a concrete direction for future systems.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines' is unsupported by any quantitative metrics, baseline details, ablation results, statistical significance tests, or error analysis. This absence is load-bearing because the improvements cannot be assessed or reproduced from the provided information.
  2. [§3 and §4] §4 (Experiments) and §3 (Pipeline): The description of how context-aware filtering, outline-aware reranking, and VLM-based analysis convert noisy web images into accurate evidence atoms lacks implementation specifics, pseudocode, or ablation studies. Without these, it is impossible to verify whether the pipeline reduces hallucinations or merely relocates errors, directly affecting the weakest assumption in the central claim.
minor comments (2)
  1. [Introduction] Introduction: The term 'evidence atoms' is used repeatedly but never given a formal definition or illustrative example, which could improve clarity for readers.
  2. [Related Work] Related Work: The discussion of prior multimodal retrieval systems could benefit from additional citations to recent VLM-based grounding papers to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We agree that the experimental claims in the abstract and the implementation details in the pipeline require stronger quantitative support and reproducibility elements to fully substantiate our contributions. We address each major comment below and commit to revisions that will incorporate the suggested enhancements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines' is unsupported by any quantitative metrics, baseline details, ablation results, statistical significance tests, or error analysis. This absence is load-bearing because the improvements cannot be assessed or reproduced from the provided information.

    Authors: We acknowledge that the abstract presents a high-level summary of the experimental outcomes without embedding specific metrics. The full manuscript in Section 4 reports quantitative results, including human-rated report quality scores, source-figure integration precision/recall, verifiability rates (measured via reference validation accuracy), and comparisons against baselines such as commercial systems and open-source multimodal agents, along with statistical significance where applicable. To directly address the concern, we will revise the abstract to include key quantitative highlights (e.g., relative improvements in verifiability and integration) and add a concise results summary table. We will also expand the experiments section with explicit baseline configurations, full ablation tables, and error analysis. These changes will make the claims fully assessable and reproducible. revision: yes

  2. Referee: [§3 and §4] §4 (Experiments) and §3 (Pipeline): The description of how context-aware filtering, outline-aware reranking, and VLM-based analysis convert noisy web images into accurate evidence atoms lacks implementation specifics, pseudocode, or ablation studies. Without these, it is impossible to verify whether the pipeline reduces hallucinations or merely relocates errors, directly affecting the weakest assumption in the central claim.

    Authors: We agree that greater implementation transparency is needed to demonstrate the pipeline's effectiveness in reducing hallucinations. Section 3 currently outlines the three stages at a conceptual level; in the revision we will add detailed pseudocode for context-aware filtering (including scoring functions and thresholds), outline-aware reranking (with similarity metrics and reranking algorithm), and VLM-based visual analysis (prompt templates and output parsing). We will also insert ablation studies in Section 4 that quantify the contribution of each component to evidence accuracy, hallucination reduction, and overall report verifiability, plus an error analysis categorizing remaining failure modes. These additions will clarify that the pipeline improves grounding rather than relocating errors. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical multimodal system ViDR whose components (context-aware filtering, outline-aware reranking, VLM-based visual analysis, evidence-indexed outlines) are presented as engineering choices rather than derived quantities. Claims of improvement rest on external baseline comparisons on MMR Bench+ rather than any internal equations, fitted parameters, or self-referential predictions. No derivation chain exists that reduces outputs to inputs by construction, and no self-citation load-bearing steps are identifiable from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that vision-language models can perform reliable context-aware filtering and visual analysis on web-sourced images to produce usable evidence atoms.

axioms (1)
  • domain assumption Vision-language models can reliably interpret and filter noisy web images into accurate evidence atoms
    Invoked in the refinement pipeline and section-specific evidence generation steps described in the abstract.
invented entities (1)
  • evidence atoms no independent evidence
    purpose: Modular, routable units of source visual evidence for report grounding
    New conceptual object introduced to treat figures as retrievable and verifiable items rather than optional illustrations.

pith-pipeline@v0.9.0 · 5550 in / 1289 out tokens · 37340 ms · 2026-05-14T19:15:11.787518+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

  1. [1]

    Try deep research and our new experimental model in gemini, your ai assistant

    Dave Citron. Try deep research and our new experimental model in gemini, your ai assistant. Google Blog (Gemini), December 2024. URL https://blog.google/ products-and-platforms/products/gemini/google-gemini-deep-research/ . Ac- cessed: 2026-01-28

  2. [2]

    Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253, 2025

    João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, et al. Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253, 2025

  3. [3]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1610–1630, Vienna, Austria

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

  4. [4]

    gpt-researcher

    Assaf Felovic. gpt-researcher. https://github.com/assafelovic/gpt-researcher. GitHub repository. Accessed: 2025-12-29

  5. [5]

    Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  6. [6]

    Gemini deep research — your personal research assistant

    Google. Gemini deep research — your personal research assistant. https://gemini.google/ overview/deep-research/, 2025. Accessed: 2025-12-29

  7. [7]

    Mmdeepresearch-bench: A benchmark for multimodal deep research agents, 2026

    Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, Chaofan Tao, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, and Mi Zhang. Mmdeepresearch-bench: A benchmark for multimodal deep research agents, 2026. URLhttps://arxiv.org/abs/2601.12346

  8. [8]

    Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

    Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Xu Tang, Yao Hu, Shaohui Lin, Philip Torr, Feng Zhao, and Wanli Ouyang. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models, 2026. URLhttps://arxiv.org/abs/2601.22060

  9. [9]

    Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

    Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

  10. [10]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  11. [11]

    open_deep_research

    LangChain. open_deep_research. https://github.com/langchain-ai/open_deep_ research. GitHub repository. Accessed: 2025-12-29

  12. [12]

    Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report.arXiv preprint arXiv:2601.08536, 2026

    Ruizhe Li, Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report.arXiv preprint arXiv:2601.08536, 2026

  13. [13]

    Search-o1: Agentic search-enhanced large reasoning models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

  14. [14]

    Webthinker: Empowering large reasoning models with deep research capability,

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

  15. [15]

    Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025

    Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, et al. Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025. 11

  16. [16]

    Evid- fuse: Writing-time evidence learning for consistent text-chart data reporting.arXiv preprint arXiv:2601.05487, 2026

    Huanxiang Lin, Qianyue Wang, Jinwu Hu, Bailin Chen, Qing Du, and Mingkui Tan. Evid- fuse: Writing-time evidence learning for consistent text-chart data reporting.arXiv preprint arXiv:2601.05487, 2026

  17. [17]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  18. [18]

    Introducing deep research

    OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, February 2025. Published: 2025-02-02. Accessed: 2025-12-29

  19. [19]

    Deepscholar-bench: A live benchmark and automated evaluation for generative research synthesis.arXiv preprint arXiv:2508.20033, 2025

    Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, and Carlos Guestrin. Deepscholar-bench: A live benchmark and automated evaluation for generative research synthesis.arXiv preprint arXiv:2508.20033, 2025

  20. [20]

    Assisting in writing wikipedia-like articles from scratch with large language models

    Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing wikipedia-like articles from scratch with large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6252–6278, 2024

  21. [21]

    A tale of two graphs: Separating knowledge exploration from outline structure for open-ended deep research.arXiv preprint arXiv:2602.13830, 2026

    Zhuofan Shi, Ming Ma, Zekun Yao, Fangkai Yang, Jue Zhang, Dongge Han, Victor Rühle, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. A tale of two graphs: Separating knowledge exploration from outline structure for open-ended deep research.arXiv preprint arXiv:2602.13830, 2026

  22. [22]

    Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

    Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

  23. [23]

    Tongyi DeepResearch Technical Report

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

  24. [24]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  25. [25]

    Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

  26. [26]

    Beyond outlining: Heterogeneous recursive planning for adaptive long-form writing with language models

    Ruibin Xiong, Yimeng Chen, Dmitrii Khizbullin, Mingchen Zhuge, and Jürgen Schmidhuber. Beyond outlining: Heterogeneous recursive planning for adaptive long-form writing with language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24689–24725, 2025

  27. [27]

    Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework

    Zhaorui Yang, Bo Pan, Han Wang, Yiyao Wang, Xingyu Liu, Luoxuan Weng, Yingchaojie Feng, Haozhe Feng, Minfeng Zhu, Bo Zhang, et al. Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34368–34377, 2026

  28. [28]

    Miroeval: Benchmarking multimodal deep research agents in process and outcome.arXiv preprint arXiv:2603.28407, 2026

    Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, et al. Miroeval: Benchmarking multimodal deep research agents in process and outcome.arXiv preprint arXiv:2603.28407, 2026

  29. [29]

    Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

    Fangda Ye, Zhifei Xie, Yuxin Hu, Yihang Yin, Shurui Huang, Shikai Dong, Jianzhu Bao, and Shuicheng Yan. Deep-reporter: Deep research for grounded multimodal long-form generation. arXiv preprint arXiv:2604.10741, 2026. 12

  30. [30]

    Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models, 2026

    Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, and Shaosheng Cao. Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models, 2026. URL https://arxiv.org/ abs/2602.02185

  31. [31]

    monthly performance,

    Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025. 13 15 15 1514 13 13 11 10 10 10 9 8 6 4 4 3 Tech & Media Healthcare AI & ML Climate & Env Agri & Food Economy & Work Econ & Policy Systems & HW Biomed & Hea...

  32. [32]

    Visual Features: what is explicitly visible in the image

  33. [33]

    Deductive Fact: the factual or quantitative claim supported by the image

  34. [34]

    image_id

    Rationale: how the image acts as evidence for the broader topic. Also return up to {question_num} follow-up questions for further research. <contents>{contents}</contents> G.3 Stage A : Adaptive Outline Update and Query Guidance PROMPTTEMPLATE: ADAPTIVEOUTLINE ANDQUERYGUIDANCE Outline update system: You are an expert research planner. Maintain a living Ma...

  35. [35]

    Improve structure, expression, analytical presentation, and readability

  36. [36]

    Important rules: - Use only information already present in the report, supplied learnings, and media inventory

    Improve evidential support, precision, credibility, and verifiability. Important rules: - Use only information already present in the report, supplied learnings, and media inventory. Do not invent facts, data, sources, or claims. - Preserve every [[MEDIA_ANCHOR_xxx]] token exactly once. - Keep links, source names, citations, footnote-style references, and...