Recognition: unknown
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
Pith reviewed 2026-05-07 16:13 UTC · model grok-4.3
The pith
Even top large multimodal models degrade sharply when tested on real enterprise documents and corner cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CC-OCR V2 is introduced as a comprehensive OCR benchmark for real-world document processing that includes hard and corner cases absent from prior tests. The benchmark spans five major tracks with a total of 7,093 high-difficulty samples. Extensive testing of 14 advanced large multimodal models reveals that even the strongest ones experience substantial performance degradation across tasks and conditions. This demonstrates a notable disconnect between results on existing benchmarks and actual effectiveness in practical applications.
What carries the argument
The CC-OCR V2 benchmark, which tailors tasks to enterprise needs and emphasizes underrepresented difficult samples across five tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering.
If this is right
- Current LMMs are not yet suitable for reliable real-world document processing without additional improvements.
- Existing OCR benchmarks do not adequately test for practical challenges.
- The new dataset enables more accurate assessment of model capabilities in enterprise settings.
- Future model development should target the identified failure modes in hard cases.
- The evaluation toolkit supports standardized testing of future models on real-world scenarios.
Where Pith is reading between the lines
- Researchers may now focus on collecting more diverse real-world document data for training to close the identified gap.
- Similar benchmarking gaps could exist in other multimodal tasks such as image understanding or video analysis.
- Companies using LMMs for document tasks should implement additional safeguards or human oversight until performance improves.
- The benchmark might inspire hybrid systems combining LMMs with traditional OCR tools for better robustness in edge cases.
Load-bearing premise
The 7,093 samples across five tracks sufficiently capture the critical hard and corner cases in real-world enterprise document processing.
What would settle it
If the same 14 models achieve accuracy on CC-OCR V2 comparable to their scores on prior benchmarks, or if new samples from actual enterprise failures show different error patterns, the claimed performance gap would not hold.
Figures
read the original abstract
Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CC-OCR V2, a benchmark with 7,093 high-difficulty samples across five tracks (text recognition, document parsing, document grounding, key information extraction, and document question answering) aimed at real-world enterprise document processing. It evaluates 14 LMMs and reports substantial performance degradation relative to prior benchmarks, concluding that current models fall short of practical requirements and that a significant gap exists between existing benchmark performance and real-world effectiveness. The dataset and evaluation toolkit are released publicly.
Significance. If the samples are shown to be drawn from or matched to actual enterprise distributions and to target the specific failure modes that matter for deployment, the benchmark would be a useful contribution for exposing limitations in LMM document literacy and motivating more robust models. The public release aids reproducibility.
major comments (1)
- [§3] §3 (Benchmark Construction): The curation process for the 7,093 samples is described at a high level as incorporating 'hard and corner cases that are critical yet underrepresented,' but provides no quantitative validation (e.g., comparison of error-type distributions to production logs, statistical matching to enterprise corpora, or expert-rated difficulty scores). This detail is load-bearing for the central claim that observed degradation on the 14 models reveals a genuine real-world gap rather than simply a harder synthetic test set.
minor comments (2)
- [Abstract] Abstract and §4: The exact per-track metrics (e.g., edit distance, F1, or accuracy definitions) and annotation protocol are not fully specified, making it difficult to interpret the reported performance numbers and degradation claims.
- [Results] Table 1 or equivalent results section: Clarify baseline comparisons to prior OCR benchmarks to make the 'significant gap' claim more precise.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The feedback on benchmark construction highlights an important aspect of substantiating our claims about real-world applicability. We address the major comment below and outline planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The curation process for the 7,093 samples is described at a high level as incorporating 'hard and corner cases that are critical yet underrepresented,' but provides no quantitative validation (e.g., comparison of error-type distributions to production logs, statistical matching to enterprise corpora, or expert-rated difficulty scores). This detail is load-bearing for the central claim that observed degradation on the 14 models reveals a genuine real-world gap rather than simply a harder synthetic test set.
Authors: We agree that additional substantiation of the curation process would strengthen the central claim. The 7,093 samples were assembled by selecting documents from diverse real-world enterprise sources (e.g., invoices, contracts, forms, and reports) and prioritizing instances exhibiting documented failure modes of prior OCR systems, such as dense tables, handwritten annotations, degraded scans, and domain-specific terminology. Selection was guided by expert review from practitioners in document processing. However, direct quantitative matching to proprietary production logs or statistical distribution comparisons is not feasible due to data access restrictions. We will revise §3 to provide a more detailed breakdown of the difficulty criteria used, including a taxonomy of included corner cases with examples and references to common real-world challenges reported in the literature. We will also include a new subsection discussing how the observed performance gaps align with known deployment issues rather than arbitrary hardness. This constitutes a partial revision, as full quantitative validation against external corpora would require additional data collection beyond the current scope. revision: partial
Circularity Check
No circularity: purely empirical benchmark evaluation
full rationale
The paper introduces CC-OCR V2 as a new dataset of 7,093 samples across five tracks and reports direct empirical results from evaluating 14 external LMMs on it. No mathematical derivations, fitted parameters, self-citations, or ansatzes are present in the abstract or described methodology. The central claim of performance degradation is a straightforward measurement against the released dataset and models, with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected 7093 samples and five tracks accurately capture underrepresented hard and corner cases in real-world document processing.
Reference graph
Works this paper leans on
-
[1]
Science China Information Sciences , volume=
Ocrbench: on the hidden mystery of ocr in large multimodal models , author=. Science China Information Sciences , volume=. 2024 , publisher=
2024
-
[2]
Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning , author=. arXiv preprint arXiv:2501.00321 , year=
-
[3]
arXiv preprint arXiv:2601.09668 , year=
Step3-vl-10b technical report , author=. arXiv preprint arXiv:2601.09668 , year=
-
[4]
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding , author=. arXiv preprint arXiv:2601.10611 , year=
-
[5]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=
work page internal anchor Pith review arXiv
-
[6]
MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining , author=. arXiv preprint arXiv:2505.07608 , year=
-
[7]
Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe , author=. arXiv preprint arXiv:2509.18154 , year=
-
[8]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Readoc: A unified benchmark for realistic document structured extraction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[9]
arXiv preprint arXiv:2603.25761 , year=
A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents , author=. arXiv preprint arXiv:2603.25761 , year=
-
[10]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[11]
arXiv preprint arXiv:2603.04205 , year=
Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild , author=. arXiv preprint arXiv:2603.04205 , year=
-
[12]
arXiv preprint arXiv:2111.08609 , year=
Document ai: Benchmarks, models and applications , author=. arXiv preprint arXiv:2111.08609 , year=
-
[13]
olmocr: Unlocking trillions of tokens in pdfs with vision language models , author=. arXiv preprint arXiv:2502.18443 , year=
-
[14]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[15]
A survey of deep learning approaches for ocr and document understanding , author=. arXiv preprint arXiv:2011.13534 , year=
-
[16]
Document Intelligence in the Era of Large Language Models: A Survey , author=. arXiv preprint arXiv:2510.13366 , year=
-
[17]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Efficient ocr for building a diverse digital history , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[18]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=
DUBLIN: visual document understanding by language-image network , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=
2023
-
[19]
International Workshop on Document Analysis Systems , pages=
Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval , author=. International Workshop on Document Analysis Systems , pages=. 2024 , organization=
2024
-
[20]
UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents , author=. arXiv preprint arXiv:2602.07038 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=
work page internal anchor Pith review arXiv
-
[22]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Information extraction from visually rich documents using LLM-based organization of documents into independent textual segments , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[23]
Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction , author=. arXiv preprint arXiv:2410.21169 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Document understanding dataset and evaluation (dude) , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[25]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[26]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
Lmms-eval: Reality check on the evaluation of large multimodal models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
2025
-
[27]
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training , author=. arXiv preprint arXiv:2603.23885 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
IEEE Journal of Selected Topics in Signal Processing , year=
R-bench: Are your large multimodal model robust to real-world corruptions? , author=. IEEE Journal of Selected Topics in Signal Processing , year=
-
[29]
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026) , pages=
OCRTurk: A Comprehensive OCR Benchmark for Turkish , author=. Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026) , pages=
2026
-
[30]
arXiv preprint arXiv:2511.18434 , year=
DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation , author=. arXiv preprint arXiv:2511.18434 , year=
-
[31]
Artificial Intelligence Review , year=
Deep learning based visually rich document content understanding: A survey , author=. Artificial Intelligence Review , year=
-
[32]
arXiv preprint arXiv:2601.00150 , year=
FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications , author=. arXiv preprint arXiv:2601.00150 , year=
-
[33]
arXiv preprint arXiv:2408.01319 , year=
A comprehensive review of multimodal large language models: Performance and challenges across different tasks , author=. arXiv preprint arXiv:2408.01319 , year=
-
[34]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Docllm: A layout-aware generative language model for multimodal document understanding , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[35]
European conference on computer vision , pages=
Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , pages=. 2024 , organization=
2024
-
[36]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
Revise: A framework for revising ocred text in practical information systems with data contamination strategy , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
-
[37]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[38]
Proceedings of the 28th ACM International Conference on Multimedia , pages=
TRIE: end-to-end text reading and information extraction for document understanding , author=. Proceedings of the 28th ACM International Conference on Multimedia , pages=
-
[39]
Hunyuanocr technical report , author=. arXiv preprint arXiv:2511.19575 , year=
-
[40]
Proceedings of the 30th ACM international conference on multimedia , pages=
Layoutlmv3: Pre-training for document ai with unified text and image masking , author=. Proceedings of the 30th ACM international conference on multimedia , pages=
-
[41]
International Conference on Machine Learning , pages=
Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[42]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Unifying vision, text, and layout for universal document processing , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[43]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Multimodal large language models for text-rich image understanding: A comprehensive review , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[44]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Textmonkey: An ocr-free large multimodal model for understanding document , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[45]
Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
2023
-
[46]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Dockylin: A large multimodal model for visual document understanding with efficient visual slimming , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[47]
European Conference on Computer Vision , pages=
An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=
2024
-
[48]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
CROP: Contextual Region-Oriented Visual Token Pruning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[49]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[50]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[51]
Advances in Neural Information Processing Systems , volume=
Hierarchical visual feature aggregation for ocr-free document understanding , author=. Advances in Neural Information Processing Systems , volume=
-
[52]
European Conference on Computer Vision , pages=
Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images , author=. European Conference on Computer Vision , pages=. 2024 , organization=
2024
-
[53]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Hires-llava: Restoring fragmentation input in high-resolution large vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[54]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[55]
Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=
Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=
2022
-
[56]
Kosmos-2.5: A multimodal liter- ate model
Kosmos-2.5: A multimodal literate model , author=. arXiv preprint arXiv:2309.11419 , year=
-
[57]
Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains
Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains , author=. arXiv preprint arXiv:2602.13235 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment , author=. arXiv preprint arXiv:2604.07419 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
arXiv preprint arXiv:2603.13398 , year=
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence , author=. arXiv preprint arXiv:2603.13398 , year=
-
[60]
arXiv preprint arXiv:2505.18603 , year=
Doc-CoB: Enhancing multi-modal document understanding with visual chain-of-boxes reasoning , author=. arXiv preprint arXiv:2505.18603 , year=
-
[61]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Docr1: Evidence page-guided grpo for multi-page document understanding , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[62]
arXiv preprint arXiv:2505.22019 (2025)
Vrag-rl: Empower vision-perception-based rag for visually rich information understanding via iterative reasoning with reinforcement learning , author=. arXiv preprint arXiv:2505.22019 , year=
-
[63]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Pixel Reasoner: Incentivizing Pixel Space Reasoning via Curiosity-Driven Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[64]
arXiv preprint arXiv:2408.08632 , year=
A survey on benchmarks of multimodal large language models , author=. arXiv preprint arXiv:2408.08632 , year=
-
[65]
Mme-survey: A comprehensive survey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296,
Mme-survey: A comprehensive survey on evaluation of multimodal llms , author=. arXiv preprint arXiv:2411.15296 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.