pith. machine review for the scientific record. sign in

arxiv: 2604.21277 · v2 · submitted 2026-04-23 · 💻 cs.AI

Recognition: unknown

Can MLLMs "Read" What is Missing?

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:34 UTC · model grok-4.3

classification 💻 cs.AI
keywords MLLMstext reconstructionmultimodal benchmarksdocument understandingvisual groundingmasked text recovery
0
0 comments X

The pith

Multimodal language models struggle to reconstruct masked text from visual layouts in documents without prompts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMTR-Bench to test whether multimodal large language models can recover masked text using only visual context from documents and webpages. By removing explicit instructions, the benchmark focuses on the models' built-in abilities to understand layouts and ground text visually. Results indicate that current models find this task difficult, particularly when reconstructing full sentences or paragraphs across different languages and document types. A sympathetic reader would care because it highlights gaps in how these models integrate vision and language for real-world document processing.

Core claim

MMTR-Bench evaluates the intrinsic ability of MLLMs to reconstruct masked text directly from visual context in single- or multi-page inputs from real-world domains. It eliminates explicit prompts to isolate layout understanding, visual grounding, and knowledge integration. The benchmark includes 2,771 test samples in multiple languages with varying lengths, using a level-aware evaluation protocol. Experiments demonstrate that the task poses a significant challenge to representative MLLMs, especially for sentence- and paragraph-level reconstruction.

What carries the argument

MMTR-Bench, a prompt-free benchmark for text reconstruction from visual document inputs that assesses layout understanding and visual grounding.

If this is right

  • Performance drops notably for sentence- and paragraph-length reconstructions compared to shorter spans.
  • The task remains difficult across multiple languages and real-world domains like documents and webpages.
  • Models show limited success on multi-page inputs where visual context spans several pages.
  • The level-aware evaluation accounts for target length diversity to measure reconstruction at different scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future training of MLLMs could add reconstruction objectives to strengthen visual-text integration.
  • The benchmark might extend to testing other cases of incomplete visual or textual information.
  • Document analysis tools could improve by addressing the layout and grounding weaknesses identified here.

Load-bearing premise

Removing explicit prompts successfully isolates the model's intrinsic layout understanding, visual grounding, and knowledge integration from its instruction-following abilities.

What would settle it

Observing high reconstruction accuracy on paragraph-level masked texts without prompts would falsify the claim that MLLMs face significant challenges in this intrinsic ability.

Figures

Figures reproduced from arXiv: 2604.21277 by Chaozheng Huang, Jindi Guo, Xi Fang.

Figure 1
Figure 1. Figure 1: Overall performance of representative models on MMTR-Bench. Models from the same [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Our results show that MMTR-Bench is still highly challenging. Stronger closed-source [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline consists of four main stages: (1) Data Preparation, where valuable text is [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview statistics of MMTR-Bench, including level distribution, single-page versus [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compact overview of model behavior on MMTR-Bench, including difficulty-level trends, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model performance breakdown across semantic categories, layout elements, background [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: A challenging case of MMTR-Bench. 4.6.1 Ground Truth Analysis The Ground Truth (GT) for the masked region is HATCH. The masked bounding box is located at a protruding structure on the roof of the building. The lower sections of the schematic are explicitly labeled, including a “BIN” (storage bin) and areas indicating “HOT AIR” and “COOL AIR”. In the context of agricultural storage facilities, grain is typi… view at source ↗
Figure 7
Figure 7. Figure 7: High-scoring Case 1 C.1.2 High-scoring Case 2 This sample is a webpage screenshot designed to test the model’s world knowledge regarding gaming. The model can infer the answer from other category tags or by reasoning through the Xbox console news already visible in the image. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: High-scoring Case 2 C.1.3 High-scoring Case 3 This sample is an information-rich illustration from an academic publication, containing only the image and its title. All models performed the reasoning correctly for this sample, proving that even models with 8B parameters possess a certain level of proficiency in paper reading [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: High-scoring Case 3 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: High-scoring Case 4 C.1.5 High-scoring Case 5 This sample tests the model’s image analysis capabilities and world knowledge. The model must locate the position of tooth #2 and perform reasoning based on that placement. Additionally, some models may infer the answer by identifying which specific tooth type is missing from the existing set [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: High-scoring Case 5 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: High-scoring Case 7 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: High-scoring Case 8 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A representative failure case from MMTR-Bench. The masked target is [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: A representative failure case from MMTR-Bench in a UI browsing scenario. The masked [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A representative failure case from MMTR-Bench in an educational slide containing [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: A representative failure case from MMTR-Bench in a scientific geologic map. The masked [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: A representative failure case from MMTR-Bench in a research pipeline figure. The masked [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A comparison case for thinking and non-thinking variants. The masked target is [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: A comparison case for thinking and non-thinking variants on a wikiHow-style infographic. [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: A comparison case for thinking and non-thinking variants on a scientific figure embedded [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: A comparison case for thinking and non-thinking variants on a document page containing [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Case 1 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Case 2 from MMTR-Bench. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Case 3 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Case 4 from MMTR-Bench. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Case 5 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Case 6 from MMTR-Bench. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Case 7 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p037_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Case 8 from MMTR-Bench. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Case 9 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p038_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Case 10 from MMTR-Bench. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Case 11 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p039_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Case 12 from MMTR-Bench. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Case 13 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p040_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Case 14 from MMTR-Bench. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Case 15 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p041_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Case 16 from MMTR-Bench. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Case 17 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p042_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Case 18 from MMTR-Bench. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Case 19 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p043_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Case 20 from MMTR-Bench. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Case 21 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p044_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Case 22 from MMTR-Bench. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Case 23 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p045_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Case 24 from MMTR-Bench. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Case 25 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p046_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Case 26 from MMTR-Bench. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Case 27 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p047_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: Case 28 from MMTR-Bench. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: Case 29 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p048_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: Case 30 from MMTR-Bench. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: Case 31 from MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p049_53.png] view at source ↗
Figure 54
Figure 54. Figure 54: Case 32 from MMTR-Bench. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_54.png] view at source ↗
Figure 55
Figure 55. Figure 55: Language distribution of MMTR-Bench [PITH_FULL_IMAGE:figures/full_fig_p051_55.png] view at source ↗
Figure 56
Figure 56. Figure 56: Fine-grained source distribution of MMTR-Bench. [PITH_FULL_IMAGE:figures/full_fig_p051_56.png] view at source ↗
Figure 57
Figure 57. Figure 57: Histogram of the number of context images per sample. [PITH_FULL_IMAGE:figures/full_fig_p052_57.png] view at source ↗
Figure 58
Figure 58. Figure 58: Relationship between mask ratio and target character length. [PITH_FULL_IMAGE:figures/full_fig_p052_58.png] view at source ↗
Figure 59
Figure 59. Figure 59: Heatmap of difficulty level versus semantic category. [PITH_FULL_IMAGE:figures/full_fig_p053_59.png] view at source ↗
Figure 60
Figure 60. Figure 60: Heatmap of input mode versus layout element. [PITH_FULL_IMAGE:figures/full_fig_p053_60.png] view at source ↗
Figure 61
Figure 61. Figure 61: Heatmap of input mode versus semantic category. [PITH_FULL_IMAGE:figures/full_fig_p054_61.png] view at source ↗
Figure 62
Figure 62. Figure 62: Per-sample score distributions for top-performing models. [PITH_FULL_IMAGE:figures/full_fig_p056_62.png] view at source ↗
Figure 63
Figure 63. Figure 63: Difficulty-level profile for top-performing models. [PITH_FULL_IMAGE:figures/full_fig_p056_63.png] view at source ↗
Figure 64
Figure 64. Figure 64: Performance under Single vs. Multi Context [PITH_FULL_IMAGE:figures/full_fig_p057_64.png] view at source ↗
Figure 65
Figure 65. Figure 65: Performance Gain from Multi Context 57 [PITH_FULL_IMAGE:figures/full_fig_p057_65.png] view at source ↗
Figure 66
Figure 66. Figure 66: Full heatmap over semantic categories [PITH_FULL_IMAGE:figures/full_fig_p058_66.png] view at source ↗
Figure 67
Figure 67. Figure 67: Full heatmap over layout elements [PITH_FULL_IMAGE:figures/full_fig_p058_67.png] view at source ↗
Figure 68
Figure 68. Figure 68: Full heatmap over context scope. G.3 Slice-level variation 58 [PITH_FULL_IMAGE:figures/full_fig_p058_68.png] view at source ↗
Figure 69
Figure 69. Figure 69: Full heatmap over background complexity. [PITH_FULL_IMAGE:figures/full_fig_p059_69.png] view at source ↗
Figure 70
Figure 70. Figure 70: Full heatmap over text density [PITH_FULL_IMAGE:figures/full_fig_p059_70.png] view at source ↗
Figure 71
Figure 71. Figure 71: Full heatmap over fine-grained source types. [PITH_FULL_IMAGE:figures/full_fig_p059_71.png] view at source ↗
read the original abstract

We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MMTR-Bench, a benchmark of 2,771 samples designed to test MLLMs' intrinsic ability to reconstruct masked text from visual context in single- or multi-page documents and webpages. By omitting explicit prompts, the design aims to isolate layout understanding, visual grounding, and knowledge integration from instruction-following. The benchmark spans multiple languages and reconstruction lengths, employs a level-aware evaluation protocol, and reports that experiments on representative MLLMs show the task is especially challenging at sentence- and paragraph-level.

Significance. If the no-prompt protocol validly isolates the targeted capabilities and the empirical results prove robust, MMTR-Bench would supply a useful new evaluation axis for MLLMs that moves beyond standard QA formats. The level-aware protocol and multi-lingual/multi-length coverage are concrete strengths that address real diversity in reconstruction difficulty.

major comments (2)
  1. [Abstract] Abstract: the claim that 'experiments on representative MLLMs show that the benchmark poses a significant challenge' supplies no quantitative metrics, baselines, error analysis, or model-specific scores, leaving the central empirical assertion without visible support.
  2. [Method] Method / no-prompt design: the assumption that removing explicit prompts cleanly separates intrinsic layout understanding, visual grounding, and knowledge integration from instruction-following is load-bearing yet untested; no analysis is given of whether models produce targeted reconstructions, generic captions, refusals, or ignore masked regions, an issue that is most acute for sentence- and paragraph-level items.
minor comments (1)
  1. [Abstract] The provided homepage URL is useful, but the manuscript does not state dataset licensing, access method, or construction reproducibility details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'experiments on representative MLLMs show that the benchmark poses a significant challenge' supplies no quantitative metrics, baselines, error analysis, or model-specific scores, leaving the central empirical assertion without visible support.

    Authors: We agree that the abstract would be strengthened by including concrete metrics. In the revised manuscript we will expand the abstract to report key quantitative results, including average reconstruction accuracy across models, the performance gap between word-level and paragraph-level items, and a brief mention of the strongest baseline. revision: yes

  2. Referee: [Method] Method / no-prompt design: the assumption that removing explicit prompts cleanly separates intrinsic layout understanding, visual grounding, and knowledge integration from instruction-following is load-bearing yet untested; no analysis is given of whether models produce targeted reconstructions, generic captions, refusals, or ignore masked regions, an issue that is most acute for sentence- and paragraph-level items.

    Authors: We acknowledge the concern. The no-prompt protocol is intended to reduce instruction-following confounds, but we did not previously characterize output types. In the revision we add a dedicated qualitative analysis subsection that samples model generations at each length, showing that models predominantly attempt contextually grounded text rather than generic captions or refusals, while still failing to recover the masked content at longer spans. This supports the design while making the assumption more transparent. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical benchmark paper

full rationale

This is an empirical benchmark introduction paper with no derivations, equations, fitted parameters, or mathematical claims. The core contribution is the definition and application of MMTR-Bench to test MLLMs on masked text reconstruction from images, with results reported from direct experiments on representative models. No step reduces a 'prediction' or 'first-principles result' to its own inputs by construction, and there are no self-citation chains or uniqueness theorems invoked to justify the methodology. The assumption that no-prompt evaluation isolates intrinsic abilities is a design choice open to empirical scrutiny rather than a circular definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5447 in / 944 out tokens · 27522 ms · 2026-05-09T21:34:31.917765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Websrc: A dataset for web-based structural reading comprehension

    Xingyu Chen, Zihan Zhao, Lu Chen, JiaBao Ji, Danyang Zhang, Ao Luo, Yuxuan Xiong, and Kai Yu. Websrc: A dataset for web-based structural reading comprehension. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, 2021. 10

  2. [2]

    M-longdoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning framework

    Yew Ken Chia, Liying Cheng, Hou Pong Chan, Maojia Song, Chaoqun Liu, Mahani Aljunied, Soujanya Poria, and Lidong Bing. M-longdoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning framework. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9244–9261, 2025

  3. [3]

    Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating

    Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1135–...

  4. [4]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

  5. [5]

    End-to-end document understanding via chain-of-reading

    Jindi Guo, Chaozheng Huang, Haoyi Tao, Zhulin An, Guolin Ke, and Xi Fang. End-to-end document understanding via chain-of-reading

  6. [6]

    Layoutlmv3: Pre-training for document ai with unified text and image masking

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia, pages 4083–4091, 2022

  7. [7]

    Pix2struct: Screenshot parsing as pretraining for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisensch- los, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. InInternational Confer- ence on Machine Learning, pages 18893–18912. PMLR, 2023

  8. [8]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37: 95963–96010, 2024

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37: 95963–96010, 2024

  9. [9]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

  10. [10]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  11. [11]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawa- har. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022

  12. [12]

    Unifying vision, text, and layout for universal document processing

    Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19254–19264, 2023

  13. [13]

    Document understanding dataset and evaluation (dude)

    Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Valveny, et al. Document understanding dataset and evaluation (dude). InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19528–19540, 2023

  14. [14]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

  15. [15]

    Hawker 800 series (BAe 125/Hawker 800)

    Runjie Zhou, Youbo Shao, Haoyu Lu, Bowei Xing, Tongtong Bai, Yujie Chen, Jie Zhao, Lin Sui, Haotian Yao, Zijia Zhao, et al. Worldvqa: Measuring atomic world knowledge in multimodal large language models.arXiv preprint arXiv:2602.02537, 2026. 11 Table 3: Overview of the level-aware dynamic evaluation strategy.w and τ represent the semantic weight and factu...