pith. machine review for the scientific record. sign in

arxiv: 2501.00321 · v2 · pith:OS4GEAZFnew · submitted 2024-12-31 · 💻 cs.CV · cs.AI

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Pith reviewed 2026-05-17 20:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords OCRBench v2Large Multimodal ModelsOCR evaluationText localizationVisual reasoningBenchmarkMultimodal AI
0
0 comments X

The pith

A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OCRBench v2 as a larger and more demanding evaluation set for how well large multimodal models read and reason about text inside images. It expands to four times as many tasks as prior versions, spans 31 real-world scenarios, and supplies 10,000 human-checked question-answer pairs along with a private test set of 1,500 images. When state-of-the-art models are run on the benchmark, most fall short of half the maximum score and show consistent weaknesses in five areas: uncommon text, fine details, page layouts, crowded element parsing, and logical inference over the extracted text. The consistent results across public and private data suggest the test set reliably surfaces these gaps rather than measuring only easy cases.

Core claim

OCRBench v2 provides the widest coverage yet for text-centric visual understanding, with 31 scenarios, thorough metrics, 10,000 verified pairs, and a private test set; benchmarking reveals that current LMMs generally score below 50 out of 100 and exhibit five recurring limitations in less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.

What carries the argument

OCRBench v2, the expanded benchmark consisting of diverse scenarios, human-verified QA pairs, and separate public and private test sets used to measure LMM performance on text localization and reasoning.

If this is right

  • Models require targeted gains in recognizing uncommon or handwritten text.
  • Better spatial and layout understanding is needed to parse document structure.
  • Reasoning capabilities must advance to connect information across text elements.
  • Fine-grained visual detail extraction remains a bottleneck for complex scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines could prioritize data that stresses the five identified weaknesses to accelerate progress.
  • Real-world systems for document processing or scene text analysis may still need supplementary rule-based components.
  • Extending the benchmark to additional scripts or domains could expose further model gaps not visible in the current 31 scenarios.

Load-bearing premise

The chosen 31 scenarios and 10,000 question-answer pairs together with the private test set give an unbiased picture of model limits without selection effects that favor particular failure modes.

What would settle it

A new model that scores above 70 on both the public and private sets while showing none of the five listed limitations would contradict the claim that most current LMMs suffer from those specific weaknesses.

read the original abstract

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The project website is at: https://99franklin.github.io/ocrbench_v2/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces OCRBench v2, a large-scale bilingual benchmark for evaluating large multimodal models (LMMs) on visual text localization and reasoning. It expands prior work with 31 diverse scenarios (4x more tasks than OCRBench), 10,000 human-verified QA pairs featuring a high proportion of difficult samples, and a private test set of 1,500 manually annotated images. The authors benchmark state-of-the-art LMMs, reporting that most models score below 50/100 and exhibit five specific limitations: less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. Reliability is supported by consistent performance trends between the public and private test sets.

Significance. If the benchmark construction and human verification hold, this work provides a valuable, more comprehensive tool for diagnosing OCR weaknesses in LMMs beyond basic text recognition. The explicit human verification of 10,000 pairs and the private test set with consistent trends are clear strengths that enhance reproducibility and reduce selection bias concerns. The findings can usefully direct future research toward the five identified limitation categories.

minor comments (2)
  1. [Abstract] Abstract: The claim of 'thorough evaluation metrics' and 'high proportion of difficult samples' would benefit from one additional sentence summarizing the scoring rubric (e.g., exact criteria for partial credit on localization or reasoning tasks) and the selection process for difficult samples.
  2. [Benchmark Construction] The manuscript would be strengthened by a short table or paragraph in the benchmark construction section explicitly mapping the 31 scenarios to the five limitation types, to make the categorization less implicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the benchmark's strengths in scale, human verification, and private test set, and their recommendation to accept the manuscript. No major comments were raised that require point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces OCRBench v2 as a new benchmark built from freshly collected images, 31 scenarios, and 10,000 human-verified QA pairs plus a private test set. Its central claims consist of empirical performance numbers and observed limitation categories obtained by running existing LMMs on this dataset. No equations, first-principles derivations, fitted parameters, or self-citation chains are used to generate the reported scores or limitations; the results are direct measurements on independently annotated data and therefore do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on new human-annotated data collection and scenario selection rather than mathematical derivations, free parameters, or postulated entities.

axioms (1)
  • domain assumption Human verification produces accurate ground-truth labels for the 10,000 QA pairs
    The reliability claim depends on the quality of the human annotation process described in the abstract.

pith-pipeline@v0.9.0 · 5618 in / 1209 out tokens · 44320 ms · 2026-05-17T20:29:10.364789+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

    cs.CV 2026-05 conditional novelty 8.0

    PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

  2. Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

    cs.CV 2026-05 unverdicted novelty 7.0

    Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

  3. The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

    cs.CL 2026-04 accept novelty 7.0

    SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

  4. ParseBench: A Document Parsing Benchmark for AI Agents

    cs.CV 2026-04 accept novelty 7.0

    ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.

  5. From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.

  6. FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR

    cs.CV 2025-11 unverdicted novelty 7.0

    FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.

  7. SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

    cs.CV 2026-05 unverdicted novelty 6.0

    SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

  8. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

    cs.AI 2026-05 unverdicted novelty 6.0

    LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

  9. CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

    cs.CL 2026-05 unverdicted novelty 6.0

    CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.

  10. Discovering Failure Modes in Vision-Language Models using RL

    cs.CV 2026-04 unverdicted novelty 6.0

    An RL-based questioner agent adaptively generates queries to discover novel failure modes in VLMs without human intervention.

  11. MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

    cs.CV 2025-09 unverdicted novelty 6.0

    MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.

  12. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  13. Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion

    cs.CV 2026-04 unverdicted novelty 5.0

    CPIFNet decomposes non-homogeneous dehazing into multiple homogeneous sub-problems via specialized IENet branches trained on different haze concentrations, then uses IFNet to fuse advantageous regions through deep fea...

  14. Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    FPFNet reports state-of-the-art AUROC scores on MVTec-AD and VisA for unified multi-class defect detection by adding feature perturbation and hierarchical fusion to UniAD with no extra parameters.

  15. Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

    cs.CL 2026-04 unverdicted novelty 5.0

    Linear probing reveals a gap between internal representations and responses in LVLMs for visual document understanding, with task information encoded more linearly in intermediate layers than the final layer, and fine...

  16. Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction

    cs.CV 2026-04 unverdicted novelty 5.0

    A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduc...

  17. Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

    cs.CV 2026-04 unverdicted novelty 3.0

    Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...

Reference graph

Works this paper leans on

156 extracted references · 156 canonical work pages · cited by 17 Pith papers · 29 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

  3. [3]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in Neural Information Processing Systems, 2020

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023

  5. [5]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in Neural Information Processing Systems, vol. 36, 2024

  6. [6]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models,

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”Proceedings of the International Con- ference on Learning Representations, 2024

  7. [7]

    arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9

    Y . Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai, “Textmonkey: An ocr-free large multimodal model for understanding document,”arXiv preprint arXiv:2403.04473, 2024

  8. [8]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sunet al., “MME: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2023

  9. [9]

    Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi,

    K. Ying, F. Meng, J. Wang, Z. Li, H. Lin, Y . Yang, H. Zhang, W. Zhang, Y . Lin, S. Liuet al., “Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi,”arXiv preprint arXiv:2404.16006, 2024

  10. [10]

    Towards vqa models that can read,

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8317–8326

  11. [11]

    Scene text visual question answering,

    A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas, “Scene text visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4291–4301. 10

  12. [12]

    On the general value of evidence, and bilingual scene-text visual question answering,

    X. Wang, Y . Liu, C. Shen, C. C. Ng, C. Luo, L. Jin, C. S. Chan, A. v. d. Hengel, and L. Wang, “On the general value of evidence, and bilingual scene-text visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 126–10 135

  13. [13]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Linet al., “Are We on the Right Way for Evaluating Large Vision-Language Models?”arXiv preprint arXiv:2403.20330, 2024

  14. [14]

    OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

    Y . Liu, Z. Li, B. Yang, C. Li, X. Yin, C.-l. Liu, L. Jin, and X. Bai, “On the hidden mystery of ocr in large multimodal models,”arXiv preprint arXiv:2305.07895, 2023

  15. [15]

    Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension,

    B. Li, Y . Ge, Y . Chen, Y . Ge, R. Zhang, and Y . Shan, “Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension,” arXiv preprint arXiv:2404.16790, 2024

  16. [16]

    ConTextual: Evaluating Context- Sensitive Text-Rich Visual Reasoning in Large Multimodal Models,

    R. Wadhawan, H. Bansal, K.-W. Chang, and N. Peng, “ConTextual: Evaluating Context- Sensitive Text-Rich Visual Reasoning in Large Multimodal Models,” inProceedings of Inter- national Conference on Machine Learning, 2024

  17. [17]

    arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17

    C. Liu, H. Wei, J. Chen, L. Kong, Z. Ge, Z. Zhu, L. Zhao, J. Sun, C. Han, and X. Zhang, “Focus Anywhere for Fine-grained Multi-page Document Understanding,” arXiv preprint arXiv:2405.14295, 2024

  18. [18]

    Atsushi Kojima

    Y . Kim, M. Yim, and K. Y . Song, “TableVQA-Bench: A visual question answering benchmark on multiple table domains,”arXiv preprint arXiv:2404.19205, 2024

  19. [19]

    InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2131–2153, Singapore

    W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, H. Li et al., “TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy,”arXiv preprint arXiv:2406.01326, 2024

  20. [20]

    Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning,

    R. Xia, B. Zhang, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen, M. Dou, B. Shi, J. Yan et al., “Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning,”arXiv preprint arXiv:2402.12185, 2024

  21. [21]

    Qwen2.5-vl,

    Q. Team, “Qwen2.5-vl,” January 2025. [Online]. Available: https://qwenlm.github.io/blog/ qwen2.5-vl/

  22. [22]

    Docvqa: A dataset for vqa on document images,

    M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021, pp. 2200–2209

  23. [23]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu et al., “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,”arXiv preprint arXiv:2412.05271, 2024

  24. [24]

    Hello GPT-4o,

    OpenAI, “Hello GPT-4o,” https://openai.com/index/gpt-4v-system-card, 2024, accessed: 2024- 12-29

  25. [25]

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,

    L. Ouyang, Y . Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao et al., “Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,”arXiv preprint arXiv:2412.07626, 2024

  26. [26]

    Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy,

    Z. Yang, J. Tang, Z. Li, P. Wang, J. Wan, H. Zhong, X. Liu, M. Yang, P. Wang, Y . Liuet al., “Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy,”arXiv preprint arXiv:2412.02210, 2024

  27. [27]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

    Y . Ma, Y . Zang, L. Chen, M. Chen, Y . Jiao, X. Li, X. Lu, Z. Liu, Y . Ma, X. Dong et al., “Mmlongbench-doc: Benchmarking long-context document understanding with visualizations,” arXiv preprint arXiv:2407.01523, 2024. 11

  28. [28]

    Multimodal Table Understanding,

    M. Zheng, X. Feng, Q. Si, Q. She, Z. Lin, W. Jiang, and W. Wang, “Multimodal Table Understanding,” in Proceedings of Annual Meeting of the Association for Computational Linguistics , L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 9102–9124. [Online]. Available: https: //doi.org/10.18653/v1/2024.acl-long.493

  29. [29]

    MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning,

    F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho, Y . Yacoob, and D. Yu, “MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024, pp. 1287–1310

  30. [30]

    Llavar: Enhanced visual instruction tuning for text-rich image understanding,

    Y . Zhang, R. Zhang, J. Gu, Y . Zhou, N. Lipka, D. Yang, and T. Sun, “Llavar: Enhanced visual instruction tuning for text-rich image understanding,”arXiv preprint arXiv:2306.17107, 2023

  31. [31]

    InEMNLP (Find- ings)

    J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y . Dan, C. Zhao, G. Xu, C. Li, J. Tian et al., “mplug- docowl: Modularized multimodal large language model for document understanding,” arXiv preprint arXiv:2307.02499, 2023

  32. [32]

    Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding,

    H. Feng, Q. Liu, H. Liu, W. Zhou, H. Li, and C. Huang, “Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding,”arXiv preprint arXiv:2311.11810, 2023

  33. [33]

    Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model.arXiv preprint arXiv:2310.05126, 2023

    J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhanget al., “Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model,”arXiv preprint arXiv:2310.05126, 2023

  34. [34]

    LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,

    C. Luo, Y . Shen, Z. Zhu, Q. Zheng, Z. Yu, and C. Yao, “LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 630–15 640

  35. [35]

    arXiv preprint arXiv:2403.12895 (2024) 9, 10

    A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang et al., “mplug-docowl 1.5: Unified structure learning for ocr-free document understanding,”arXiv preprint arXiv:2403.12895, 2024

  36. [36]

    Dockylin: A large multimodal model for visual document understanding with efficient visual slimming,

    J. Zhang, W. Yang, S. Lai, Z. Xie, and L. Jin, “Dockylin: A large multimodal model for visual document understanding with efficient visual slimming,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9923–9932

  37. [37]

    Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding,

    W. Liao, J. Wang, H. Li, C. Wang, J. Huang, and L. Jin, “Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding,”arXiv preprint arXiv:2408.15045, 2024

  38. [38]

    A simple yet effective layout token in large language models for document understanding,

    Z. Zhu, C. Luo, Z. Shao, F. Gao, H. Xing, Q. Zheng, and J. Zhang, “A simple yet effective layout token in large language models for document understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  39. [39]

    Adaptive markup language generation for contextually- grounded visual document understanding,

    H. Xiao, Y . Xie, G. Tan, Y . Chen, R. Hu, K. Wang, A. Zhou, H. Li, H. Shao, X. Lu, P. Gao, Y . Wen, X. Chen, S. Ren, and H. Li, “Adaptive markup language generation for contextually- grounded visual document understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  40. [40]

    Marten: Visual question answering with mask generation for multi-modal document under- standing,

    Z. Wang, T. Guan, P. Fu, C. Duan, Q. Jiang, Z. Guo, S. Guo, J. Luo, W. Shen, and X. Yang, “Marten: Visual question answering with mask generation for multi-modal document under- standing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  41. [41]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022

  42. [42]

    Infographicvqa,

    M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2022, pp. 1697–1706. 12

  43. [43]

    Exploring the Capabilities of Large Multimodal Models on Dense Text,

    S. Zhang, B. Yang, Z. Li, Z. Ma, Y . Liu, and X. Bai, “Exploring the Capabilities of Large Multimodal Models on Dense Text,” inProceedings of International Conference on Document Analysis and Recognition. Springer, 2024, pp. 281–298

  44. [44]

    Onechart: Purify the chart structural extraction via one auxiliary token,

    J. Chen, L. Kong, H. Wei, C. Liu, Z. Ge, L. Zhao, J. Sun, C. Han, and X. Zhang, “Onechart: Purify the chart structural extraction via one auxiliary token,” in Proceedings of the ACM International Conference on Multimedia, 2024, pp. 147–155

  45. [45]

    Document understanding dataset and evaluation (dude),

    J. Van Landeghem, R. Tito, Ł. Borchmann, M. Pietruszka, P. Joziak, R. Powalski, D. Ju- rkiewicz, M. Coustaty, B. Anckaert, E. Valvenyet al., “Document understanding dataset and evaluation (dude),” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 528–19 540

  46. [46]

    Needle in a multimodal haystack,

    W. Wang, S. Zhang, Y . Ren, Y . Duan, T. Li, S. Liu, M. Hu, Z. Chen, K. Zhang, L. Luet al., “Needle in a multimodal haystack,” Advances in Neural Information Processing Systems , vol. 37, pp. 20 540–20 565, 2025

  47. [47]

    Hierarchical multimodal transformers for multipage docvqa,

    R. Tito, D. Karatzas, and E. Valveny, “Hierarchical multimodal transformers for multipage docvqa,”Pattern Recognition, vol. 144, p. 109834, 2023

  48. [48]

    Natural Language Engineering, 30(4):870–881

    C. Deng, J. Yuan, P. Bu, P. Wang, Z.-Z. Li, J. Xu, X.-H. Li, Y . Gao, J. Song, B. Zheng et al., “Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating,”arXiv preprint arXiv:2412.18424, 2024

  49. [49]

    Llava-next: Improved reasoning, ocr, and world knowledge,

    H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” 2024

  50. [50]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  51. [51]

    Monkey: Image resolution and text label are important things for large multi-modal models,

    Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y . Sun, Y . Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 763– 26 773

  52. [52]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini et al., “Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models,”arXiv preprint arXiv:2409.17146, 2024

  53. [53]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms,

    S. Tong, E. L. Brown II, P. Wu, S. Woo, A. J. IYER, S. C. Akula, S. Yang, J. Yang, M. Midde- pogu, Z. Wang et al., “Cambrian-1: A fully open, vision-centric exploration of multimodal llms,” inAdvances in Neural Information Processing Systems, 2024

  54. [54]

    Pixtral 12B

    P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. De Monicault, S. Garg, T. Gervet et al., “Pixtral 12b,”arXiv preprint arXiv:2410.07073, 2024

  55. [55]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang et al., “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal under- standing,”arXiv preprint arXiv:2412.10302, 2024

  56. [56]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He et al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

  57. [57]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai et al., “ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools,”arXiv preprint arXiv:2406.12793, 2024

  58. [58]

    Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

    S. Lu, Y . Li, Q.-G. Chen, Z. Xu, W. Luo, K. Zhang, and H.-J. Ye, “Ovis: Structural embedding alignment for multimodal large language model,”arXiv preprint arXiv:2405.20797, 2024

  59. [59]

    GPT-4o mini: advancing cost-efficient intelligence,

    OpenAI, “GPT-4o mini: advancing cost-efficient intelligence,” https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence, 2024, accessed: 2024-12-29. 13

  60. [60]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023

  61. [61]

    Claude 3.5 Sonnet,

    Anthropic, “Claude 3.5 Sonnet,” https://www.anthropic.com/news/claude-3-5-sonnet, 2024, accessed: 2024-12-29

  62. [62]

    Step-1V,

    StepFun, “Step-1V,” https://www.stepfun.com/#step1v, 2024, accessed: 2024-12-29

  63. [63]

    Image-based table recognition: data, model, and evaluation,

    X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes, “Image-based table recognition: data, model, and evaluation,” in Proceedings of European Conference on Computer Vision. Springer, 2020, pp. 564–580

  64. [64]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  65. [65]

    METEOR: An automatic metric for mt evaluation with improved correlation with human judgments,

    S. Banerjee and A. Lavie, “METEOR: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, pp. 65–72

  66. [66]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024

  67. [67]

    An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,

    B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, 2016

  68. [68]

    Read like humans: Autonomous, bidi- rectional and iterative language modeling for scene text recognition,

    S. Fang, H. Xie, Y . Wang, Z. Mao, and Y . Zhang, “Read like humans: Autonomous, bidi- rectional and iterative language modeling for scene text recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7098–7107

  69. [69]

    Aster: An attentional scene text recognizer with flexible rectification,

    B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2035–2048, 2018

  70. [70]

    Master: Multi-aspect non-local network for scene text recognition,

    N. Lu, W. Yu, X. Qi, Y . Chen, P. Gong, R. Xiao, and X. Bai, “Master: Multi-aspect non-local network for scene text recognition,”Pattern Recognition, vol. 117, p. 107980, 2021

  71. [71]

    SVTR: scene text recognition with a single visual model,

    Y . Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y . Du, and Y . Jiang, “SVTR: scene text recognition with a single visual model,” in Proceedings of the International Joint Conference on Artificial Intelligence, L. D. Raedt, Ed. ijcai.org, 2022, pp. 884–890. [Online]. Available: https://doi.org/10.24963/ijcai.2022/124

  72. [72]

    Abcnet: Real-time scene text spotting with adaptive bezier-curve network,

    Y . Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “Abcnet: Real-time scene text spotting with adaptive bezier-curve network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9809–9818

  73. [73]

    Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,

    Y . Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8048–8064, 2021

  74. [74]

    Text spotting transformers,

    X. Zhang, Y . Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9519–9528

  75. [75]

    Total-text: A comprehensive dataset for scene text detection and recognition,

    C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in Proceedings of International Conference on Document Analysis and Recognition, vol. 1. IEEE, 2017, pp. 935–942

  76. [76]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y . Xu, Z. Ge, L. Zhao, J. Sun, Y . Peng et al., “General ocr theory: Towards ocr-2.0 via a unified end-to-end model,”arXiv preprint arXiv:2409.01704, 2024. 14

  77. [77]

    Icdar 2013 robust reading competition,

    D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazàn, and L. P. de las Heras, “Icdar 2013 robust reading competition,” in Proceedings of International Conference on Document Analysis and Recognition, 2013, pp. 1484–1493

  78. [78]

    End-to-end scene text recognition using tree-structured models,

    C. Shi, C. Wang, B. Xiao, S. Gao, and J. Hu, “End-to-end scene text recognition using tree-structured models,” Pattern Recognition , vol. 47, pp. 2853–2866, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:30201169

  79. [79]

    Scene text recognition using higher order language priors,

    A. Mishra, K. Alahari, and C. V . Jawahar, “Scene text recognition using higher order language priors,” inBritish Machine Vision Conference, 2012

  80. [80]

    Icdar 2015 competition on robust reading,

    D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Luet al., “Icdar 2015 competition on robust reading,” in Proceedings of International Conference on Document Analysis and Recognition. IEEE, 2015, pp. 1156–1160

Showing first 80 references.