pith. sign in

arxiv: 2505.17163 · v2 · pith:FFV6VXOXnew · submitted 2025-05-22 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

classification 💻 cs.LG cs.AIcs.CLcs.CV
keywords reasoningtext-richbenchmarkimagetasksocr-reasoningcapabilitiesevaluation
0
0 comments X
read the original abstract

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

    cs.CV 2026-05 conditional novelty 8.0

    PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

  2. Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

    cs.CV 2026-05 unverdicted novelty 7.0

    Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

  3. AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

    cs.CV 2025-06 unverdicted novelty 7.0

    AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.

  4. WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    cs.IR 2025-08 unverdicted novelty 6.0

    WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.

  5. Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

    cs.LG 2026-04 unverdicted novelty 5.0

    Nemotron 3 Nano Omni is an efficient open multimodal model supporting audio, text, images, and video with reported accuracy gains and leading results on document understanding and long audio-video tasks.

  6. Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

    cs.LG 2026-04 unverdicted novelty 4.0

    Nemotron 3 Nano Omni is an efficient open multimodal model supporting audio alongside text, images, and video, with accuracy improvements and lower latency than its predecessor.