arxiv: 2605.03903 · v1 · submitted 2026-05-05 · 💻 cs.CL

Recognition: unknown

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Zhipeng Xu , Junhao Ji , Zulong Chen , Zhenghao Liu , Qing Liu , Chunyi Peng , Zubao Qin , Ze Xu

show 5 more authors

Jianqiang Wan Jun Tang Zhibo Yang Shuai Bai Dayiheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords OCRLarge Multimodal ModelsDocument ProcessingBenchmarkReal-world ApplicationsText RecognitionKey Information ExtractionDocument Question Answering

0 comments

The pith

Even top large multimodal models degrade sharply when tested on real enterprise documents and corner cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CC-OCR V2 to test large multimodal models on practical document OCR tasks that include difficult real-world examples. It evaluates 14 models across text recognition, document parsing, grounding, key information extraction, and question answering using 7093 challenging samples. The results show that models perform much worse than on standard benchmarks, indicating that existing tests do not capture the difficulties of actual applications. This matters because it highlights where these models fall short for enterprise use in processing documents. The benchmark focuses on hard cases that are underrepresented before.

Core claim

CC-OCR V2 is introduced as a comprehensive OCR benchmark for real-world document processing that includes hard and corner cases absent from prior tests. The benchmark spans five major tracks with a total of 7,093 high-difficulty samples. Extensive testing of 14 advanced large multimodal models reveals that even the strongest ones experience substantial performance degradation across tasks and conditions. This demonstrates a notable disconnect between results on existing benchmarks and actual effectiveness in practical applications.

What carries the argument

The CC-OCR V2 benchmark, which tailors tasks to enterprise needs and emphasizes underrepresented difficult samples across five tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering.

If this is right

Current LMMs are not yet suitable for reliable real-world document processing without additional improvements.
Existing OCR benchmarks do not adequately test for practical challenges.
The new dataset enables more accurate assessment of model capabilities in enterprise settings.
Future model development should target the identified failure modes in hard cases.
The evaluation toolkit supports standardized testing of future models on real-world scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Researchers may now focus on collecting more diverse real-world document data for training to close the identified gap.
Similar benchmarking gaps could exist in other multimodal tasks such as image understanding or video analysis.
Companies using LMMs for document tasks should implement additional safeguards or human oversight until performance improves.
The benchmark might inspire hybrid systems combining LMMs with traditional OCR tools for better robustness in edge cases.

Load-bearing premise

The 7,093 samples across five tracks sufficiently capture the critical hard and corner cases in real-world enterprise document processing.

What would settle it

If the same 14 models achieve accuracy on CC-OCR V2 comparable to their scores on prior benchmarks, or if new samples from actual enterprise failures show different error patterns, the claimed performance gap would not hold.

Figures

Figures reproduced from arXiv: 2605.03903 by Chunyi Peng, Dayiheng Liu, Jianqiang Wan, Junhao Ji, Jun Tang, Qing Liu, Shuai Bai, Ze Xu, Zhenghao Liu, Zhibo Yang, Zhipeng Xu, Zubao Qin, Zulong Chen.

**Figure 1.** Figure 1: Overview of CC-OCR V2. CC-OCR V2 is a comprehensive and challenging benchmark for evaluating the document literacy of LMMs in real-world document processing. It covers five major OCR-centric tracks and 74 scenarios, enabling fine-grained evaluation of document literacy in LMMs. upon CC-OCR (Yang et al., 2025b), CC-OCR V2 systematically expands task coverage to better reflect practical document processing … view at source ↗

read the original abstract

Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CC-OCR V2 releases a new 7k-sample benchmark across five enterprise OCR tracks and shows model drops, but the real-world gap claim needs stronger curation validation.

read the letter

The main takeaway is that this paper ships CC-OCR V2, a benchmark of 7093 samples split into text recognition, document parsing, grounding, key information extraction, and document QA, then runs 14 LMMs and reports clear performance drops on the harder cases. They also release the data and toolkit, which is the practical part worth noting. What they do well is call out that most existing OCR benchmarks use clean or homogeneous inputs that do not match enterprise document flows, and they try to fill that by targeting underrepresented corner cases. That framing is reasonable and the public release lowers the barrier for follow-up work. The soft spot is the leap from observed degradation to a proven real-world gap. The abstract describes the samples as hard and critical yet underrepresented, but the stress-test point holds: without details on how the cases were sourced, matched to production distributions, or scored for difficulty against actual logs, the drops could simply reflect a tougher synthetic or curated set rather than deployment shortfalls. Minor issues include the lack of explicit baseline comparisons on prior benchmarks within the same tracks and limited discussion of annotation reliability. This is aimed at people building or evaluating multimodal models for document processing in applied settings. A reader who needs additional test sets for enterprise-style OCR would find the tracks and data useful even if the interpretation stays cautious. The work shows clear empirical engagement with the literature on benchmark limitations and is coherent on its own terms, so it deserves a serious referee. I would send it to review with requests for more on sample selection and external validation.

Referee Report

1 major / 2 minor

Summary. The paper introduces CC-OCR V2, a benchmark with 7,093 high-difficulty samples across five tracks (text recognition, document parsing, document grounding, key information extraction, and document question answering) aimed at real-world enterprise document processing. It evaluates 14 LMMs and reports substantial performance degradation relative to prior benchmarks, concluding that current models fall short of practical requirements and that a significant gap exists between existing benchmark performance and real-world effectiveness. The dataset and evaluation toolkit are released publicly.

Significance. If the samples are shown to be drawn from or matched to actual enterprise distributions and to target the specific failure modes that matter for deployment, the benchmark would be a useful contribution for exposing limitations in LMM document literacy and motivating more robust models. The public release aids reproducibility.

major comments (1)

[§3] §3 (Benchmark Construction): The curation process for the 7,093 samples is described at a high level as incorporating 'hard and corner cases that are critical yet underrepresented,' but provides no quantitative validation (e.g., comparison of error-type distributions to production logs, statistical matching to enterprise corpora, or expert-rated difficulty scores). This detail is load-bearing for the central claim that observed degradation on the 14 models reveals a genuine real-world gap rather than simply a harder synthetic test set.

minor comments (2)

[Abstract] Abstract and §4: The exact per-track metrics (e.g., edit distance, F1, or accuracy definitions) and annotation protocol are not fully specified, making it difficult to interpret the reported performance numbers and degradation claims.
[Results] Table 1 or equivalent results section: Clarify baseline comparisons to prior OCR benchmarks to make the 'significant gap' claim more precise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback on benchmark construction highlights an important aspect of substantiating our claims about real-world applicability. We address the major comment below and outline planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The curation process for the 7,093 samples is described at a high level as incorporating 'hard and corner cases that are critical yet underrepresented,' but provides no quantitative validation (e.g., comparison of error-type distributions to production logs, statistical matching to enterprise corpora, or expert-rated difficulty scores). This detail is load-bearing for the central claim that observed degradation on the 14 models reveals a genuine real-world gap rather than simply a harder synthetic test set.

Authors: We agree that additional substantiation of the curation process would strengthen the central claim. The 7,093 samples were assembled by selecting documents from diverse real-world enterprise sources (e.g., invoices, contracts, forms, and reports) and prioritizing instances exhibiting documented failure modes of prior OCR systems, such as dense tables, handwritten annotations, degraded scans, and domain-specific terminology. Selection was guided by expert review from practitioners in document processing. However, direct quantitative matching to proprietary production logs or statistical distribution comparisons is not feasible due to data access restrictions. We will revise §3 to provide a more detailed breakdown of the difficulty criteria used, including a taxonomy of included corner cases with examples and references to common real-world challenges reported in the literature. We will also include a new subsection discussing how the observed performance gaps align with known deployment issues rather than arbitrary hardness. This constitutes a partial revision, as full quantitative validation against external corpora would require additional data collection beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces CC-OCR V2 as a new dataset of 7,093 samples across five tracks and reports direct empirical results from evaluating 14 external LMMs on it. No mathematical derivations, fitted parameters, self-citations, or ansatzes are present in the abstract or described methodology. The central claim of performance degradation is a straightforward measurement against the released dataset and models, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the curated samples reflect practical enterprise needs; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The selected 7093 samples and five tracks accurately capture underrepresented hard and corner cases in real-world document processing.
This assumption underpins the claim that existing benchmarks are misaligned with practical applications.

pith-pipeline@v0.9.0 · 5549 in / 1068 out tokens · 46571 ms · 2026-05-07T16:13:11.463367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 28 canonical work pages · 7 internal anchors

[1]

Science China Information Sciences , volume=

Ocrbench: on the hidden mystery of ocr in large multimodal models , author=. Science China Information Sciences , volume=. 2024 , publisher=

2024
[2]

OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning , author=. arXiv preprint arXiv:2501.00321 , year=

work page arXiv
[3]

arXiv preprint arXiv:2601.09668 , year=

Step3-vl-10b technical report , author=. arXiv preprint arXiv:2601.09668 , year=

work page arXiv
[4]

Molmo2: Open weights and data for vision-language models with video understanding and grounding, 2026

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding , author=. arXiv preprint arXiv:2601.10611 , year=

work page arXiv
[5]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review arXiv
[6]

Mimo: Unlocking the reasoning potential of language model–from pretraining to posttraining.arXiv preprint arXiv:2505.07608, 2025

MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining , author=. arXiv preprint arXiv:2505.07608 , year=

work page arXiv
[7]

Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe , author=. arXiv preprint arXiv:2509.18154 , year=

work page arXiv
[8]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Readoc: A unified benchmark for realistic document structured extraction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[9]

arXiv preprint arXiv:2603.25761 , year=

A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents , author=. arXiv preprint arXiv:2603.25761 , year=

work page arXiv
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[11]

arXiv preprint arXiv:2603.04205 , year=

Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild , author=. arXiv preprint arXiv:2603.04205 , year=

work page arXiv
[12]

arXiv preprint arXiv:2111.08609 , year=

Document ai: Benchmarks, models and applications , author=. arXiv preprint arXiv:2111.08609 , year=

work page arXiv
[13]

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models.arXiv preprint arXiv:2502.18443, 2025

olmocr: Unlocking trillions of tokens in pdfs with vision language models , author=. arXiv preprint arXiv:2502.18443 , year=

work page arXiv
[14]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[15]

Subramani, A

A survey of deep learning approaches for ocr and document understanding , author=. arXiv preprint arXiv:2011.13534 , year=

work page arXiv 2011
[16]

Document intelligence in the era of large language models: A survey.arXiv preprint arXiv:2510.13366, 2025

Document Intelligence in the Era of Large Language Models: A Survey , author=. arXiv preprint arXiv:2510.13366 , year=

work page arXiv
[17]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Efficient ocr for building a diverse digital history , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[18]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

DUBLIN: visual document understanding by language-image network , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

2023
[19]

International Workshop on Document Analysis Systems , pages=

Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval , author=. International Workshop on Document Analysis Systems , pages=. 2024 , organization=

2024
[20]

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents , author=. arXiv preprint arXiv:2602.07038 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review arXiv
[22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Information extraction from visually rich documents using LLM-based organization of documents into independent textual segments , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[23]

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction , author=. arXiv preprint arXiv:2410.21169 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Document understanding dataset and evaluation (dude) , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[25]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[26]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Lmms-eval: Reality check on the evaluation of large multimodal models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025
[27]

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training , author=. arXiv preprint arXiv:2603.23885 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

IEEE Journal of Selected Topics in Signal Processing , year=

R-bench: Are your large multimodal model robust to real-world corruptions? , author=. IEEE Journal of Selected Topics in Signal Processing , year=
[29]

Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026) , pages=

OCRTurk: A Comprehensive OCR Benchmark for Turkish , author=. Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026) , pages=

2026
[30]

arXiv preprint arXiv:2511.18434 , year=

DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation , author=. arXiv preprint arXiv:2511.18434 , year=

work page arXiv
[31]

Artificial Intelligence Review , year=

Deep learning based visually rich document content understanding: A survey , author=. Artificial Intelligence Review , year=
[32]

arXiv preprint arXiv:2601.00150 , year=

FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications , author=. arXiv preprint arXiv:2601.00150 , year=

work page arXiv
[33]

arXiv preprint arXiv:2408.01319 , year=

A comprehensive review of multimodal large language models: Performance and challenges across different tasks , author=. arXiv preprint arXiv:2408.01319 , year=

work page arXiv
[34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Docllm: A layout-aware generative language model for multimodal document understanding , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[35]

European conference on computer vision , pages=

Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , pages=. 2024 , organization=

2024
[36]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=

Revise: A framework for revising ocred text in practical information systems with data contamination strategy , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
[37]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[38]

Proceedings of the 28th ACM International Conference on Multimedia , pages=

TRIE: end-to-end text reading and information extraction for document understanding , author=. Proceedings of the 28th ACM International Conference on Multimedia , pages=
[39]

Hunyuanocr Technical Report.arXiv preprint arXiv:2511.19575, 2025

Hunyuanocr technical report , author=. arXiv preprint arXiv:2511.19575 , year=

work page arXiv
[40]

Proceedings of the 30th ACM international conference on multimedia , pages=

Layoutlmv3: Pre-training for document ai with unified text and image masking , author=. Proceedings of the 30th ACM international conference on multimedia , pages=
[41]

International Conference on Machine Learning , pages=

Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[42]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Unifying vision, text, and layout for universal document processing , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[43]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Multimodal large language models for text-rich image understanding: A comprehensive review , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[44]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Textmonkey: An ocr-free large multimodal model for understanding document , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[45]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023
[46]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Dockylin: A large multimodal model for visual document understanding with efficient visual slimming , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[47]

European Conference on Computer Vision , pages=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[48]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

CROP: Contextual Region-Oriented Visual Token Pruning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[49]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[51]

Advances in Neural Information Processing Systems , volume=

Hierarchical visual feature aggregation for ocr-free document understanding , author=. Advances in Neural Information Processing Systems , volume=
[52]

European Conference on Computer Vision , pages=

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[53]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Hires-llava: Restoring fragmentation input in high-resolution large vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[54]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[55]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

2022
[56]

Kosmos-2.5: A multimodal liter- ate model

Kosmos-2.5: A multimodal literate model , author=. arXiv preprint arXiv:2309.11419 , year=

work page arXiv
[57]

Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains , author=. arXiv preprint arXiv:2602.13235 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment , author=. arXiv preprint arXiv:2604.07419 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2603.13398 , year=

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence , author=. arXiv preprint arXiv:2603.13398 , year=

work page arXiv
[60]

arXiv preprint arXiv:2505.18603 , year=

Doc-CoB: Enhancing multi-modal document understanding with visual chain-of-boxes reasoning , author=. arXiv preprint arXiv:2505.18603 , year=

work page arXiv
[61]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Docr1: Evidence page-guided grpo for multi-page document understanding , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[62]

arXiv preprint arXiv:2505.22019 (2025)

Vrag-rl: Empower vision-perception-based rag for visually rich information understanding via iterative reasoning with reinforcement learning , author=. arXiv preprint arXiv:2505.22019 , year=

work page arXiv
[63]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Pixel Reasoner: Incentivizing Pixel Space Reasoning via Curiosity-Driven Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[64]

arXiv preprint arXiv:2408.08632 , year=

A survey on benchmarks of multimodal large language models , author=. arXiv preprint arXiv:2408.08632 , year=

work page arXiv
[65]

arXiv preprint arXiv:2411.15296 , year=

Mme-survey: A comprehensive survey on evaluation of multimodal llms , author=. arXiv preprint arXiv:2411.15296 , year=

work page arXiv