arxiv: 2501.00321 · v2 · pith:OS4GEAZFnew · submitted 2024-12-31 · 💻 cs.CV · cs.AI

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu , Zhebin Kuang , Jiajun Song , Mingxin Huang , Biao Yang , Yuzhe Li , Linghao Zhu , Qidi Luo

show 16 more authors

Xinyu Wang Hao Lu Zhang Li Guozhi Tang Bin Shan Chunhui Lin Qi Liu Binghong Wu Hao Feng Hao Liu Can Huang Jingqun Tang Wei Chen Lianwen Jin Yuliang Liu Xiang Bai

This is my paper

Pith reviewed 2026-05-17 20:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords OCRBench v2Large Multimodal ModelsOCR evaluationText localizationVisual reasoningBenchmarkMultimodal AI

0 comments

The pith

A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OCRBench v2 as a larger and more demanding evaluation set for how well large multimodal models read and reason about text inside images. It expands to four times as many tasks as prior versions, spans 31 real-world scenarios, and supplies 10,000 human-checked question-answer pairs along with a private test set of 1,500 images. When state-of-the-art models are run on the benchmark, most fall short of half the maximum score and show consistent weaknesses in five areas: uncommon text, fine details, page layouts, crowded element parsing, and logical inference over the extracted text. The consistent results across public and private data suggest the test set reliably surfaces these gaps rather than measuring only easy cases.

Core claim

OCRBench v2 provides the widest coverage yet for text-centric visual understanding, with 31 scenarios, thorough metrics, 10,000 verified pairs, and a private test set; benchmarking reveals that current LMMs generally score below 50 out of 100 and exhibit five recurring limitations in less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.

What carries the argument

OCRBench v2, the expanded benchmark consisting of diverse scenarios, human-verified QA pairs, and separate public and private test sets used to measure LMM performance on text localization and reasoning.

If this is right

Models require targeted gains in recognizing uncommon or handwritten text.
Better spatial and layout understanding is needed to parse document structure.
Reasoning capabilities must advance to connect information across text elements.
Fine-grained visual detail extraction remains a bottleneck for complex scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines could prioritize data that stresses the five identified weaknesses to accelerate progress.
Real-world systems for document processing or scene text analysis may still need supplementary rule-based components.
Extending the benchmark to additional scripts or domains could expose further model gaps not visible in the current 31 scenarios.

Load-bearing premise

The chosen 31 scenarios and 10,000 question-answer pairs together with the private test set give an unbiased picture of model limits without selection effects that favor particular failure modes.

What would settle it

A new model that scores above 70 on both the public and private sets while showing none of the five listed limitations would contradict the claim that most current LMMs suffer from those specific weaknesses.

read the original abstract

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The project website is at: https://99franklin.github.io/ocrbench_v2/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OCRBench v2 broadens coverage for LMM text evaluation with consistent verification but could clarify scoring and sample selection.

read the letter

OCRBench v2 expands the testing ground for large multimodal models when it comes to handling text in images. The main point is that current models mostly score below 50 out of 100 and show weaknesses in five areas: rare text recognition, fine-grained perception, layout understanding, parsing complex elements, and logical reasoning. The paper improves on the first OCRBench in several concrete ways. It includes four times as many tasks, covers 31 scenarios, supports two languages, and adds a private test set of 1,500 images. The 10,000 human-verified question-answer pairs and the consistent results between the public and private sets are positive signs for the benchmark's reliability. There are a couple of softer areas. More information on the scoring rubrics and the criteria for picking difficult samples would strengthen the presentation. Without those details, it is harder to fully assess how well the benchmark isolates the claimed limitations. This is the kind of paper that researchers in document AI and visual question answering will find useful. It gives a clear picture of where models need work and provides a new resource for measuring progress. Readers who follow benchmarks in multimodal learning should take a look at the results. I recommend putting this through peer review. The added scope and the verification efforts make it worth the attention of referees, even if some methodological points could use clarification.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces OCRBench v2, a large-scale bilingual benchmark for evaluating large multimodal models (LMMs) on visual text localization and reasoning. It expands prior work with 31 diverse scenarios (4x more tasks than OCRBench), 10,000 human-verified QA pairs featuring a high proportion of difficult samples, and a private test set of 1,500 manually annotated images. The authors benchmark state-of-the-art LMMs, reporting that most models score below 50/100 and exhibit five specific limitations: less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. Reliability is supported by consistent performance trends between the public and private test sets.

Significance. If the benchmark construction and human verification hold, this work provides a valuable, more comprehensive tool for diagnosing OCR weaknesses in LMMs beyond basic text recognition. The explicit human verification of 10,000 pairs and the private test set with consistent trends are clear strengths that enhance reproducibility and reduce selection bias concerns. The findings can usefully direct future research toward the five identified limitation categories.

minor comments (2)

[Abstract] Abstract: The claim of 'thorough evaluation metrics' and 'high proportion of difficult samples' would benefit from one additional sentence summarizing the scoring rubric (e.g., exact criteria for partial credit on localization or reasoning tasks) and the selection process for difficult samples.
[Benchmark Construction] The manuscript would be strengthened by a short table or paragraph in the benchmark construction section explicitly mapping the 31 scenarios to the five limitation types, to make the categorization less implicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the benchmark's strengths in scale, human verification, and private test set, and their recommendation to accept the manuscript. No major comments were raised that require point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces OCRBench v2 as a new benchmark built from freshly collected images, 31 scenarios, and 10,000 human-verified QA pairs plus a private test set. Its central claims consist of empirical performance numbers and observed limitation categories obtained by running existing LMMs on this dataset. No equations, first-principles derivations, fitted parameters, or self-citation chains are used to generate the reported scores or limitations; the results are direct measurements on independently annotated data and therefore do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on new human-annotated data collection and scenario selection rather than mathematical derivations, free parameters, or postulated entities.

axioms (1)

domain assumption Human verification produces accurate ground-truth labels for the 10,000 QA pairs
The reliability claim depends on the quality of the human annotation process described in the abstract.

pith-pipeline@v0.9.0 · 5618 in / 1209 out tokens · 44320 ms · 2026-05-17T20:29:10.364789+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
cs.CV 2026-05 conditional novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
cs.CL 2026-04 accept novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
ParseBench: A Document Parsing Benchmark for AI Agents
cs.CV 2026-04 accept novelty 7.0

ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR
cs.CV 2025-11 unverdicted novelty 7.0

FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
cs.CV 2026-05 unverdicted novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
cs.CL 2026-05 unverdicted novelty 6.0

CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
Discovering Failure Modes in Vision-Language Models using RL
cs.CV 2026-04 unverdicted novelty 6.0

An RL-based questioner agent adaptively generates queries to discover novel failure modes in VLMs without human intervention.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
cs.CV 2025-09 unverdicted novelty 6.0

MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion
cs.CV 2026-04 unverdicted novelty 5.0

CPIFNet decomposes non-homogeneous dehazing into multiple homogeneous sub-problems via specialized IENet branches trained on different haze concentrations, then uses IFNet to fuse advantageous regions through deep fea...
Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection
cs.CV 2026-04 unverdicted novelty 5.0

FPFNet reports state-of-the-art AUROC scores on MVTec-AD and VisA for unified multi-class defect detection by adding feature perturbation and hierarchical fusion to UniAD with no extra parameters.
Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding
cs.CL 2026-04 unverdicted novelty 5.0

Linear probing reveals a gap between internal representations and responses in LVLMs for visual document understanding, with task information encoded more linearly in intermediate layers than the final layer, and fine...
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
cs.CV 2026-04 unverdicted novelty 5.0

A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduc...
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
cs.CV 2026-04 unverdicted novelty 3.0

Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...

Reference graph

Works this paper leans on

156 extracted references · 156 canonical work pages · cited by 17 Pith papers · 29 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in Neural Information Processing Systems, 2020

work page 2020
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[6]

Minigpt-4: Enhancing vision-language understanding with advanced large language models,

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”Proceedings of the International Con- ference on Learning Representations, 2024

work page 2024
[7]

arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9

Y . Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai, “Textmonkey: An ocr-free large multimodal model for understanding document,”arXiv preprint arXiv:2403.04473, 2024

work page arXiv 2024
[8]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sunet al., “MME: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi,

K. Ying, F. Meng, J. Wang, Z. Li, H. Lin, Y . Yang, H. Zhang, W. Zhang, Y . Lin, S. Liuet al., “Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi,”arXiv preprint arXiv:2404.16006, 2024

work page arXiv 2024
[10]

Towards vqa models that can read,

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8317–8326

work page 2019
[11]

Scene text visual question answering,

A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas, “Scene text visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4291–4301. 10

work page 2019
[12]

On the general value of evidence, and bilingual scene-text visual question answering,

X. Wang, Y . Liu, C. Shen, C. C. Ng, C. Luo, L. Jin, C. S. Chan, A. v. d. Hengel, and L. Wang, “On the general value of evidence, and bilingual scene-text visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 126–10 135

work page 2020
[13]

Are We on the Right Way for Evaluating Large Vision-Language Models?

L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Linet al., “Are We on the Right Way for Evaluating Large Vision-Language Models?”arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Y . Liu, Z. Li, B. Yang, C. Li, X. Yin, C.-l. Liu, L. Jin, and X. Bai, “On the hidden mystery of ocr in large multimodal models,”arXiv preprint arXiv:2305.07895, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension,

B. Li, Y . Ge, Y . Chen, Y . Ge, R. Zhang, and Y . Shan, “Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension,” arXiv preprint arXiv:2404.16790, 2024

work page arXiv 2024
[16]

ConTextual: Evaluating Context- Sensitive Text-Rich Visual Reasoning in Large Multimodal Models,

R. Wadhawan, H. Bansal, K.-W. Chang, and N. Peng, “ConTextual: Evaluating Context- Sensitive Text-Rich Visual Reasoning in Large Multimodal Models,” inProceedings of Inter- national Conference on Machine Learning, 2024

work page 2024
[17]

arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17

C. Liu, H. Wei, J. Chen, L. Kong, Z. Ge, Z. Zhu, L. Zhao, J. Sun, C. Han, and X. Zhang, “Focus Anywhere for Fine-grained Multi-page Document Understanding,” arXiv preprint arXiv:2405.14295, 2024

work page arXiv 2024
[18]

Atsushi Kojima

Y . Kim, M. Yim, and K. Y . Song, “TableVQA-Bench: A visual question answering benchmark on multiple table domains,”arXiv preprint arXiv:2404.19205, 2024

work page arXiv 2024
[19]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2131–2153, Singapore

W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, H. Li et al., “TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy,”arXiv preprint arXiv:2406.01326, 2024

work page arXiv 2024
[20]

Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning,

R. Xia, B. Zhang, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen, M. Dou, B. Shi, J. Yan et al., “Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning,”arXiv preprint arXiv:2402.12185, 2024

work page arXiv 2024
[21]

Qwen2.5-vl,

Q. Team, “Qwen2.5-vl,” January 2025. [Online]. Available: https://qwenlm.github.io/blog/ qwen2.5-vl/

work page 2025
[22]

Docvqa: A dataset for vqa on document images,

M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021, pp. 2200–2209

work page 2021
[23]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu et al., “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,”arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Hello GPT-4o,

OpenAI, “Hello GPT-4o,” https://openai.com/index/gpt-4v-system-card, 2024, accessed: 2024- 12-29

work page 2024
[25]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,

L. Ouyang, Y . Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao et al., “Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,”arXiv preprint arXiv:2412.07626, 2024

work page arXiv 2024
[26]

Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy,

Z. Yang, J. Tang, Z. Li, P. Wang, J. Wan, H. Zhong, X. Liu, M. Yang, P. Wang, Y . Liuet al., “Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy,”arXiv preprint arXiv:2412.02210, 2024

work page arXiv 2024
[27]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

Y . Ma, Y . Zang, L. Chen, M. Chen, Y . Jiao, X. Li, X. Lu, Z. Liu, Y . Ma, X. Dong et al., “Mmlongbench-doc: Benchmarking long-context document understanding with visualizations,” arXiv preprint arXiv:2407.01523, 2024. 11

work page arXiv 2024
[28]

Multimodal Table Understanding,

M. Zheng, X. Feng, Q. Si, Q. She, Z. Lin, W. Jiang, and W. Wang, “Multimodal Table Understanding,” in Proceedings of Annual Meeting of the Association for Computational Linguistics , L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 9102–9124. [Online]. Available: https: //doi.org/10.18653/v1/2024.acl-long.493

work page doi:10.18653/v1/2024.acl-long.493 2024
[29]

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning,

F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho, Y . Yacoob, and D. Yu, “MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024, pp. 1287–1310

work page 2024
[30]

Llavar: Enhanced visual instruction tuning for text-rich image understanding,

Y . Zhang, R. Zhang, J. Gu, Y . Zhou, N. Lipka, D. Yang, and T. Sun, “Llavar: Enhanced visual instruction tuning for text-rich image understanding,”arXiv preprint arXiv:2306.17107, 2023

work page arXiv 2023
[31]

InEMNLP (Find- ings)

J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y . Dan, C. Zhao, G. Xu, C. Li, J. Tian et al., “mplug- docowl: Modularized multimodal large language model for document understanding,” arXiv preprint arXiv:2307.02499, 2023

work page arXiv 2023
[32]

Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding,

H. Feng, Q. Liu, H. Liu, W. Zhou, H. Li, and C. Huang, “Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding,”arXiv preprint arXiv:2311.11810, 2023

work page arXiv 2023
[33]

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model.arXiv preprint arXiv:2310.05126, 2023

J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhanget al., “Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model,”arXiv preprint arXiv:2310.05126, 2023

work page arXiv 2023
[34]

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,

C. Luo, Y . Shen, Z. Zhu, Q. Zheng, Z. Yu, and C. Yao, “LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 630–15 640

work page 2024
[35]

arXiv preprint arXiv:2403.12895 (2024) 9, 10

A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang et al., “mplug-docowl 1.5: Unified structure learning for ocr-free document understanding,”arXiv preprint arXiv:2403.12895, 2024

work page arXiv 2024
[36]

Dockylin: A large multimodal model for visual document understanding with efficient visual slimming,

J. Zhang, W. Yang, S. Lai, Z. Xie, and L. Jin, “Dockylin: A large multimodal model for visual document understanding with efficient visual slimming,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9923–9932

work page 2025
[37]

Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding,

W. Liao, J. Wang, H. Li, C. Wang, J. Huang, and L. Jin, “Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding,”arXiv preprint arXiv:2408.15045, 2024

work page arXiv 2024
[38]

A simple yet effective layout token in large language models for document understanding,

Z. Zhu, C. Luo, Z. Shao, F. Gao, H. Xing, Q. Zheng, and J. Zhang, “A simple yet effective layout token in large language models for document understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[39]

Adaptive markup language generation for contextually- grounded visual document understanding,

H. Xiao, Y . Xie, G. Tan, Y . Chen, R. Hu, K. Wang, A. Zhou, H. Li, H. Shao, X. Lu, P. Gao, Y . Wen, X. Chen, S. Ren, and H. Li, “Adaptive markup language generation for contextually- grounded visual document understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[40]

Marten: Visual question answering with mask generation for multi-modal document under- standing,

Z. Wang, T. Guan, P. Fu, C. Duan, Q. Jiang, Z. Guo, S. Guo, J. Luo, W. Shen, and X. Yang, “Marten: Visual question answering with mask generation for multi-modal document under- standing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[41]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Infographicvqa,

M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2022, pp. 1697–1706. 12

work page 2022
[43]

Exploring the Capabilities of Large Multimodal Models on Dense Text,

S. Zhang, B. Yang, Z. Li, Z. Ma, Y . Liu, and X. Bai, “Exploring the Capabilities of Large Multimodal Models on Dense Text,” inProceedings of International Conference on Document Analysis and Recognition. Springer, 2024, pp. 281–298

work page 2024
[44]

Onechart: Purify the chart structural extraction via one auxiliary token,

J. Chen, L. Kong, H. Wei, C. Liu, Z. Ge, L. Zhao, J. Sun, C. Han, and X. Zhang, “Onechart: Purify the chart structural extraction via one auxiliary token,” in Proceedings of the ACM International Conference on Multimedia, 2024, pp. 147–155

work page 2024
[45]

Document understanding dataset and evaluation (dude),

J. Van Landeghem, R. Tito, Ł. Borchmann, M. Pietruszka, P. Joziak, R. Powalski, D. Ju- rkiewicz, M. Coustaty, B. Anckaert, E. Valvenyet al., “Document understanding dataset and evaluation (dude),” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 528–19 540

work page 2023
[46]

Needle in a multimodal haystack,

W. Wang, S. Zhang, Y . Ren, Y . Duan, T. Li, S. Liu, M. Hu, Z. Chen, K. Zhang, L. Luet al., “Needle in a multimodal haystack,” Advances in Neural Information Processing Systems , vol. 37, pp. 20 540–20 565, 2025

work page 2025
[47]

Hierarchical multimodal transformers for multipage docvqa,

R. Tito, D. Karatzas, and E. Valveny, “Hierarchical multimodal transformers for multipage docvqa,”Pattern Recognition, vol. 144, p. 109834, 2023

work page 2023
[48]

Natural Language Engineering, 30(4):870–881

C. Deng, J. Yuan, P. Bu, P. Wang, Z.-Z. Li, J. Xu, X.-H. Li, Y . Gao, J. Song, B. Zheng et al., “Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating,”arXiv preprint arXiv:2412.18424, 2024

work page arXiv 2024
[49]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” 2024

work page 2024
[50]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Monkey: Image resolution and text label are important things for large multi-modal models,

Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y . Sun, Y . Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 763– 26 773

work page 2024
[52]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini et al., “Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models,”arXiv preprint arXiv:2409.17146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms,

S. Tong, E. L. Brown II, P. Wu, S. Woo, A. J. IYER, S. C. Akula, S. Yang, J. Yang, M. Midde- pogu, Z. Wang et al., “Cambrian-1: A fully open, vision-centric exploration of multimodal llms,” inAdvances in Neural Information Processing Systems, 2024

work page 2024
[54]

Pixtral 12B

P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. De Monicault, S. Garg, T. Gervet et al., “Pixtral 12b,”arXiv preprint arXiv:2410.07073, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang et al., “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal under- standing,”arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He et al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai et al., “ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools,”arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

S. Lu, Y . Li, Q.-G. Chen, Z. Xu, W. Luo, K. Zhang, and H.-J. Ye, “Ovis: Structural embedding alignment for multimodal large language model,”arXiv preprint arXiv:2405.20797, 2024

work page arXiv 2024
[59]

GPT-4o mini: advancing cost-efficient intelligence,

OpenAI, “GPT-4o mini: advancing cost-efficient intelligence,” https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence, 2024, accessed: 2024-12-29. 13

work page 2024
[60]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Claude 3.5 Sonnet,

Anthropic, “Claude 3.5 Sonnet,” https://www.anthropic.com/news/claude-3-5-sonnet, 2024, accessed: 2024-12-29

work page 2024
[62]

Step-1V,

StepFun, “Step-1V,” https://www.stepfun.com/#step1v, 2024, accessed: 2024-12-29

work page 2024
[63]

Image-based table recognition: data, model, and evaluation,

X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes, “Image-based table recognition: data, model, and evaluation,” in Proceedings of European Conference on Computer Vision. Springer, 2020, pp. 564–580

work page 2020
[64]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318

work page 2002
[65]

METEOR: An automatic metric for mt evaluation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “METEOR: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, pp. 65–72

work page 2005
[66]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,

B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, 2016

work page 2016
[68]

Read like humans: Autonomous, bidi- rectional and iterative language modeling for scene text recognition,

S. Fang, H. Xie, Y . Wang, Z. Mao, and Y . Zhang, “Read like humans: Autonomous, bidi- rectional and iterative language modeling for scene text recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7098–7107

work page 2021
[69]

Aster: An attentional scene text recognizer with flexible rectification,

B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2035–2048, 2018

work page 2035
[70]

Master: Multi-aspect non-local network for scene text recognition,

N. Lu, W. Yu, X. Qi, Y . Chen, P. Gong, R. Xiao, and X. Bai, “Master: Multi-aspect non-local network for scene text recognition,”Pattern Recognition, vol. 117, p. 107980, 2021

work page 2021
[71]

SVTR: scene text recognition with a single visual model,

Y . Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y . Du, and Y . Jiang, “SVTR: scene text recognition with a single visual model,” in Proceedings of the International Joint Conference on Artificial Intelligence, L. D. Raedt, Ed. ijcai.org, 2022, pp. 884–890. [Online]. Available: https://doi.org/10.24963/ijcai.2022/124

work page doi:10.24963/ijcai.2022/124 2022
[72]

Abcnet: Real-time scene text spotting with adaptive bezier-curve network,

Y . Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “Abcnet: Real-time scene text spotting with adaptive bezier-curve network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9809–9818

work page 2020
[73]

Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,

Y . Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8048–8064, 2021

work page 2021
[74]

Text spotting transformers,

X. Zhang, Y . Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9519–9528

work page 2022
[75]

Total-text: A comprehensive dataset for scene text detection and recognition,

C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in Proceedings of International Conference on Document Analysis and Recognition, vol. 1. IEEE, 2017, pp. 935–942

work page 2017
[76]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y . Xu, Z. Ge, L. Zhao, J. Sun, Y . Peng et al., “General ocr theory: Towards ocr-2.0 via a unified end-to-end model,”arXiv preprint arXiv:2409.01704, 2024. 14

work page internal anchor Pith review arXiv 2024
[77]

Icdar 2013 robust reading competition,

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazàn, and L. P. de las Heras, “Icdar 2013 robust reading competition,” in Proceedings of International Conference on Document Analysis and Recognition, 2013, pp. 1484–1493

work page 2013
[78]

End-to-end scene text recognition using tree-structured models,

C. Shi, C. Wang, B. Xiao, S. Gao, and J. Hu, “End-to-end scene text recognition using tree-structured models,” Pattern Recognition , vol. 47, pp. 2853–2866, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:30201169

work page 2014
[79]

Scene text recognition using higher order language priors,

A. Mishra, K. Alahari, and C. V . Jawahar, “Scene text recognition using higher order language priors,” inBritish Machine Vision Conference, 2012

work page 2012
[80]

Icdar 2015 competition on robust reading,

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Luet al., “Icdar 2015 competition on robust reading,” in Proceedings of International Conference on Document Analysis and Recognition. IEEE, 2015, pp. 1156–1160

work page 2015

Showing first 80 references.