pith. sign in

arxiv: 2606.26602 · v1 · pith:ESM274CNnew · submitted 2026-06-25 · 💻 cs.CV

DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues

Pith reviewed 2026-06-26 05:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords DiCoBenchMLLMsfine-grained perceptionmulti-image benchmarkhigh-resolution imagesdifferential visual cuescommonality visual cuesvisual perception evaluation
0
0 comments X

The pith

DiCoBench reveals multimodal models lag far behind humans on autonomous high-resolution multi-image perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiCoBench to evaluate multimodal large language models on fine-grained perception of implicit visual cues across multiple high-resolution images. It contains 765 curated samples organized into Differential Visual Cues and Commonality Visual Cues tracks that cover eight tasks. When eighteen diverse MLLMs are tested in a multiple-choice format, they show large gaps from the 98.3 percent human accuracy level, especially on micro-scale details. Prior benchmarks are argued to have relied on explicit text or low-resolution inputs that masked these limitations. The new benchmark uses near-2K resolution images to force models to perceive cues without external guidance.

Core claim

DiCoBench consists of 765 high-resolution multi-image samples split into two progressive tracks of differential and commonality visual cues across eight perception tasks. Formulated as multiple-choice questions, it shows that eighteen evaluated MLLMs achieve substantially lower accuracy than humans at 98.3 percent, with top models struggling most on micro-scale detail capture. This gap is presented as evidence that current models cannot yet perform autonomous perception of implicit cues in high-resolution settings.

What carries the argument

DiCoBench benchmark with its Differential Visual Cues and Commonality Visual Cues tracks, using high-resolution inputs and multiple-choice questions to isolate autonomous cross-image perception.

If this is right

  • MLLMs require targeted advances to capture micro-scale details across multiple high-resolution images.
  • The benchmark supplies a standardized test for measuring progress in autonomous multi-image perception.
  • Future models will need mechanisms that compare implicit cues between images without external prompts.
  • Performance on DiCoBench can serve as an indicator of readiness for tasks demanding cross-image detail analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gaps may appear in single-image high-resolution settings if tested with the same implicit-cue constraints.
  • Training regimes that emphasize cross-image comparison without text hints could narrow the observed gap.
  • The two-track design implies that commonality detection and difference detection may expose distinct model weaknesses.

Load-bearing premise

The 765 samples were curated to test autonomous perception of implicit visual cues without textual or resolution biases, and the multiple-choice format fully removes evaluation metric bias.

What would settle it

Demonstrating that a model reaches near 98.3 percent accuracy on the benchmark or finding unintended textual cues in the samples that guide correct answers would undermine the reported performance gap.

Figures

Figures reproduced from arXiv: 2606.26602 by Geng Li, Yuxin Peng.

Figure 1
Figure 1. Figure 1: Overview of our proposed DiCoBench. (a) DiCoBench covers 2 major perception categories and 8 specific perception tasks. (b) We observe that the aver￾age resolution of existing multi-modal benchmarks remains primarily in low-resolution scenarios. In contrast, our proposed DiCoBench reaches approaching 2K. (c) Due to high-resolution constraints, the largest existing single-image fine-grained perception bench… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between DiCoBench and previous benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on DiCoBench. The first and second rows illustrate examples of the four task types within the Differential Visual Cues Tasks and Com￾monality Visual Cues Tasks. Notably, the question for each task contains no explicit text cues. To answer correctly, models must actively perceive the visual cues represent￾ing differences or commonalities directly from the image pair. 3.1 Task Definition … view at source ↗
Figure 4
Figure 4. Figure 4: Accuracies of MLLMs on Di￾CoBench. Performance Disparities across Tasks: A comparison across task types reveals that the categoriza￾tion (Cat.) task within Commonality Tasks is a relative strength for cur￾rent models (with most scoring above 50%), whereas the reasoning (Rea.) task remains a universal “Achilles’ heel,” with most models hovering around 20% in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Human accuracy on Di￾CoBench as a function of perceptual duration. The SOTA model Gemini￾3-Pro is shown for comparison [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Proportional distribution of error types for Gemini-3-Pro and Qwen-3-VL on DiCoBench. In this paper, we have presented DiCoBench, the first comprehensive benchmark tailored to evaluate high￾resolution, cross-image fine-grained perception in MLLMs. By synthesiz￾ing datasets that necessitate the de￾tection of implicit visual cues with￾out explicit textual guidance, we have demonstrated that current SOTA mode… view at source ↗
read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive fine-grained perception capabilities. However, existing benchmarks predominantly rely on explicit textual cues or low-resolution inputs, failing to evaluate a model's ability to autonomously perceive implicit visual cues in high-resolution. To bridge this gap, we introduce DiCoBench, a comprehensive, multi-image high-resolution benchmark designed for cross-image fine-grained perception. DiCoBench consists of 765 meticulously curated samples categorized into two progressive tracks: Differential Visual Cues and Commonality Visual Cues, covering 8 distinct perception tasks. By formulating the benchmark as a multiple-choice question task and utilizing high-resolution imagery (approaching 2K), we eliminate evaluation metric bias and pose a substantial challenge to current state-of-the-art MLLMs. Our extensive evaluation of 18 diverse MLLMs reveals a striking performance gap compared to human accuracy (98.3\%), with top-performing models struggling significantly with micro-scale detail capture. We believe DiCoBench will serve as a challenging testbed to drive future research in autonomous, high-resolution multi-image perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DiCoBench, a benchmark of 765 high-resolution (~2K) multi-image samples divided into Differential Visual Cues and Commonality Visual Cues tracks across 8 perception tasks. Formulated as multiple-choice questions, it evaluates 18 MLLMs and reports a large gap to human performance (98.3%), attributing the gap to models' difficulty with autonomous perception of implicit micro-scale visual cues.

Significance. If the curation process can be shown to avoid selection biases and textual/resolution artifacts, DiCoBench would offer a useful addition to existing MLLM benchmarks by targeting implicit cross-image cues at high resolution. The scale of the evaluation (18 models) and the explicit human baseline provide a concrete starting point for future work on fine-grained multi-image perception.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim of a 'striking performance gap' between models and the 98.3% human baseline depends on the 765 samples genuinely testing implicit cue perception without curation artifacts. However, the manuscript provides no description of source image selection criteria, filtering steps, annotation protocol, inter-annotator agreement, or controls against textual leakage and option imbalance.
  2. [§4] §4 (Experiments): No statistical tests, confidence intervals, or per-task error analysis are reported for the model results, and the criteria used to select the 18 MLLMs are not stated. These omissions make it impossible to assess whether the reported gap is robust or could be explained by sampling variance or model choice.
  3. [Human evaluation] Human evaluation paragraph: The 98.3% human accuracy figure is presented without the number of annotators, their expertise level, agreement statistics, or the exact protocol used to obtain the baseline, rendering the model-human comparison difficult to interpret.
minor comments (2)
  1. [§3] A summary table listing the 8 tasks, their distribution across the two tracks, and sample counts per task would improve readability of the benchmark description.
  2. [Abstract] The abstract states 'approaching 2K' resolution; the exact pixel dimensions or range used for the images should be stated explicitly in the benchmark section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on DiCoBench. We address each major comment point by point below. We will revise the manuscript to incorporate additional details and analyses where the comments identify omissions, thereby improving the transparency and rigor of the benchmark presentation.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim of a 'striking performance gap' between models and the 98.3% human baseline depends on the 765 samples genuinely testing implicit cue perception without curation artifacts. However, the manuscript provides no description of source image selection criteria, filtering steps, annotation protocol, inter-annotator agreement, or controls against textual leakage and option imbalance.

    Authors: We agree that explicit documentation of the curation process is required to substantiate the absence of artifacts. In the revised manuscript we will expand §3 with a dedicated subsection detailing source image selection criteria, filtering steps, the full annotation protocol, inter-annotator agreement statistics, and the specific controls applied against textual leakage and option imbalance. These additions will directly support the claim that the samples evaluate implicit cross-image cue perception. revision: yes

  2. Referee: [§4] §4 (Experiments): No statistical tests, confidence intervals, or per-task error analysis are reported for the model results, and the criteria used to select the 18 MLLMs are not stated. These omissions make it impossible to assess whether the reported gap is robust or could be explained by sampling variance or model choice.

    Authors: We concur that statistical rigor and transparency in model selection strengthen the evaluation. The revised §4 will include appropriate statistical tests, confidence intervals around reported accuracies, and a per-task error breakdown. We will also state the explicit selection criteria for the 18 MLLMs, emphasizing architectural diversity and coverage of leading open- and closed-source models. revision: yes

  3. Referee: [Human evaluation] Human evaluation paragraph: The 98.3% human accuracy figure is presented without the number of annotators, their expertise level, agreement statistics, or the exact protocol used to obtain the baseline, rendering the model-human comparison difficult to interpret.

    Authors: We acknowledge that additional information is needed to interpret the human baseline. In the revised manuscript we will expand the human-evaluation paragraph to report the number of annotators, their expertise level, inter-annotator agreement statistics, and the precise protocol followed to obtain the 98.3% figure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark without derivations or fitted predictions

full rationale

The paper introduces DiCoBench as a new dataset of 765 curated samples for evaluating MLLMs on multi-image perception tasks. The abstract and provided text contain no equations, parameter fitting, self-citations used as load-bearing premises, or any derivation chain. Claims about performance gaps (e.g., models vs. 98.3% human accuracy) are direct empirical results from running models on the benchmark, not quantities that reduce to the curation process by construction. No steps match the enumerated circularity patterns. This is a standard empirical contribution with independent content in the benchmark design and evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark paper with no free parameters, invented entities, or non-standard axioms beyond typical CV evaluation assumptions.

axioms (1)
  • domain assumption High-resolution multi-image inputs and multiple-choice format are appropriate and unbiased for measuring autonomous fine-grained perception.
    Invoked to justify the benchmark design as eliminating textual and resolution biases.

pith-pipeline@v0.9.1-grok · 5724 in / 1166 out tokens · 23066 ms · 2026-06-26T05:15:38.398562+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 2 canonical work pages

  1. [1]

    org/abs/2509.23661

    An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., Wu, C., Tan, H., Li, C., Yang, J., Yu, J., Wang, X., Qin, B., Wang, Y., Yan, Z., Feng, Z., Liu, Z., Li, B., Deng, J.: Llava-onevision-1.5: Fully open framework for democratized multimodal training (2025),https://arxiv. org/abs/2509.23661

  2. [2]

    Advances in Neural Information Processing Systems37, 107795– 107829 (2024)

    Awal, R., Ahmadi, S., Zhang, L., Agrawal, A.: Vismin: Visual minimal-change understanding. Advances in Neural Information Processing Systems37, 107795– 107829 (2024)

  3. [3]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  4. [4]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

  5. [5]

    Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal modelswith open-source suites.ScienceChina InformationSciences67(12),220101 (2024)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Di, Z., Shi, J., Fan, Y., Tan, H., Black, A., Collomosse, J., Liu, Y.: Difftell: A high-quality dataset for describing image manipulation changes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 24580–24590 (2025)

  7. [7]

    arXiv preprint arXiv:2509.17040 (2025)

    Du, H., Zhang, J., Nan, G., Deng, W., Chen, Z., Zhang, C., Xiao, W., Huang, S., Pan, Y., Qi, T., Leng, S.: From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning. arXiv preprint arXiv:2509.17040 (2025)

  8. [8]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024)

  9. [9]

    arXiv preprint arXiv:2404.12390 (2024)

    Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. arXiv preprint arXiv:2404.12390 (2024)

  10. [10]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, 16 G

    Guo, Z., Sun, J., Wang, T.J.J., Radman, A., Pehlivan, S., Cao, M., Laaksonen, J.: Learning to describe implicit changes: Noise-robust pre-training for image dif- ference captioning. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, 16 G. Li and Y. Peng V. (eds.) Findings of the Association for Computational Linguistics: EMNLP

  11. [11]

    10125–10145

    pp. 10125–10145. Association for Computational Linguistics, Suzhou, China (Nov 2025).https://doi.org/10.18653/v1/2025.findings-emnlp.537,https: //aclanthology.org/2025.findings-emnlp.537/

  12. [12]

    Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model (2026),https://arxiv.org/abs/2511.05271

  13. [13]

    In: Proceedings of the Asian Conference on Computer Vision

    Hu, E., Guo, L., Yue, T., Zhao, Z., Xue, S., Liu, J.: Onediff: A generalist model for image difference captioning. In: Proceedings of the Asian Conference on Computer Vision. pp. 2439–2455 (2024)

  14. [14]

    arXiv preprint arXiv:1808.10584 (2018)

    Jhamtani, H., Berg-Kirkpatrick, T.: Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584 (2018)

  15. [15]

    Advances in Neural Information Processing Systems37, 28798–28827 (2024)

    Kil, J., Mai, Z., Lee, J., Chowdhury, A., Wang, Z., Cheng, K., Wang, L., Liu, Y., Chao, W.L.H.: Mllm-compbench: A comparative reasoning benchmark for multi- modal llms. Advances in Neural Information Processing Systems37, 28798–28827 (2024)

  16. [16]

    arXiv preprint arXiv:2509.07969 (2025)

    Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search. arXiv preprint arXiv:2509.07969 (2025)

  17. [17]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Li, J., Wu, M., Jin, Z., Chen, H., Ji, J., Sun, X., Cao, L., Ji, R.: Mihbench: Bench- marking and mitigating multi-image hallucinations in multimodal large language models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3143–3152 (2025)

  18. [18]

    He Had Dangerous Delusions. ChatGPT Admitted It Made Them Worse

    Li, Y., Tu, Y., Li, L., Su, L., Huang, Q.: Change entity-guided heterogeneous repre- sentation disentangling for change captioning. In: Che, W., Nabende, J., Shutova, E.,Pilehvar,M.T.(eds.)FindingsoftheAssociationforComputationalLinguistics: ACL 2025. pp. 17050–17060. Association for Computational Linguistics, Vienna, Austria (Jul 2025).https://doi.org/10...

  19. [19]

    arXiv preprint arXiv:2501.05767 (2025)

    Li, Y., Huang, H., Chen, C., Huang, K., Huang, C., Guo, Z., Liu, Z., Xu, J., Li, Y., Li, R., et al.: Migician: Revealing the magic of free-form multi-image grounding in multimodal large language models. arXiv preprint arXiv:2501.05767 (2025)

  20. [20]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, Y., Hou, S., Hou, S., Du, J., Meng, S., Huang, Y.: Omnidiff: A comprehensive benchmark for fine-grained image difference captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21440–21449 (2025)

  21. [21]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4624–4633 (2019)

  22. [22]

    arXiv preprint arXiv:2404.18532 (2024)

    Song, D., Chen, S., Chen, G.H., Yu, F., Wan, X., Wang, B.: Milebench: Bench- marking mllms in long context. arXiv preprint arXiv:2404.18532 (2024)

  23. [23]

    In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

    Tan, H., Dernoncourt, F., Lin, Z., Bui, T., Bansal, M.: Expressing visual relation- ships via language. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 1873–1883 (2019)

  24. [24]

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., Mesnard, T., Cideron, G., bastien Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Cole...

  25. [25]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Tong,T.C.,He,S.,Shao,Z.,Yeung,D.Y.:G-veval:Aversatilemetricforevaluating image and video captions using gpt-4o. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 7419–7427 (2025)

  26. [26]

    arXiv preprint arXiv:2406.09411 (2024)

    Wang, F., Fu, X., Huang, J.Y., Li, Z., Liu, Q., Liu, X., Ma, M.D., Xu, N., Zhou, W., Zhang, K., et al.: Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411 (2024)

  27. [27]

    arXiv preprint arXiv:2507.07999 (2025)

    Wang, H., Li, X., Huang, Z., Wang, A., Wang, J., Zhang, T., Zheng, J., Bai, S., Kang, Z., Feng, J., et al.: Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology. arXiv preprint arXiv:2507.07999 (2025)

  28. [28]

    arXiv preprint arXiv:2409.12191 (2024)

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  29. [29]

    Li and Y

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

  30. [30]

    In: AAAI

    Wang, W., Ding, L., Zeng, M., Zhou, X., Shen, L., Luo, Y., Yu, W., Tao, D.: Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In: AAAI. vol. 39, pp. 7907–7915 (2025)

  31. [31]

    spot the difference

    Wei, K., Hu, B., Cao, J., Chen, X., Lu, Z., Xia, W., Xu, W., Wu, J., He, J., Jia, M., et al.:m3−verse: A" spot the difference" challenge for large multimodal models. arXiv preprint arXiv:2512.18735 (2025)

  32. [32]

    In: CVPR

    Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. In: CVPR. pp. 13084–13094 (2024)

  33. [33]

    In: CVPR (2024)

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal under- standing and reasoning benchmark for expert agi. In: CVPR (2024)

  34. [34]

    In: ACM Multimedia 2024 (2024),https://openreview.net/forum?id=eiGs5VCsYM

    Zhang, X., Wen, H., Wu, J., Qin, P., Xue’, H., Nie, L.: Differential-perceptive and retrieval-augmented MLLM for change captioning. In: ACM Multimedia 2024 (2024),https://openreview.net/forum?id=eiGs5VCsYM

  35. [35]

    Zhang, Y.F., Lu, X., Yin, S., Fu, C., Chen, W., Hu, X., Wen, B., Jiang, K., Liu, C., Zhang, T., Fan, H., Chen, K., Chen, J., Ding, H., Tang, K., Zhang, Z., Wang, L., Yang, F., Gao, T., Zhou, G.: Thyme: Think beyond images (2025),https: //arxiv.org/abs/2508.11630

  36. [36]

    thinking with images

    Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025)