When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation

Change Jia; Jingang Huang; Linglin Zhang; Lin Sun; Wang Dexian; Xiangzheng Zhang; Zhengwei Cheng

arxiv: 2605.00911 · v1 · submitted 2026-04-29 · 💻 cs.CV

When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation

Lin Sun , Wang Dexian , Jingang Huang , Linglin Zhang , Change Jia , Zhengwei Cheng , Xiangzheng Zhang This is my paper

Pith reviewed 2026-05-09 20:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords OCR robustnessRetrieval-augmented generationDocument benchmarkIndustrial documentsRAG pipelineStructural errorsSemantic fidelity

0 comments

The pith

High OCR accuracy does not ensure strong retrieval-augmented generation performance on realistic industrial documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard character-level OCR metrics such as word and character error rates fail to capture how well extracted text supports downstream retrieval and generation in RAG systems. It presents a new benchmark covering eleven document categories with extreme layouts, watermarks, tables, formulas, and non-standard reading orders. When recent OCR models are run through a controlled OCR-first RAG pipeline, retrieval failures rise sharply on these documents even though conventional benchmark scores remain high. The gaps arise from structural and semantic distortions rather than simple character mistakes and appear consistently across document types and pipeline variations.

Core claim

High OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. The mismatch between conventional OCR scores and RAG effectiveness is category-dependent, occurs on both retrieval and generation sides, and holds across representative OCR-first pipeline choices.

What carries the argument

An OCR-first RAG evaluation pipeline that measures end-to-end retrieval and answer quality on eleven industrial document categories instead of isolated character error rates.

If this is right

Structural and semantic errors in OCR output produce retrieval failures even when character accuracy metrics look good.
The performance gap between conventional OCR benchmarks and RAG effectiveness varies by document category.
Both retrieval-side and generation-side failures contribute to the observed degradation.
The mismatch remains stable across different representative OCR-first pipeline configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG developers may need to optimize OCR components for layout preservation and semantic fidelity rather than character accuracy alone.
Post-processing steps that repair structure and reading order could close part of the gap without changing the underlying OCR engine.
Future OCR training objectives could incorporate retrieval or question-answering signals as additional supervision.

Load-bearing premise

The chosen eleven document categories and the controlled OCR-first RAG pipeline sufficiently represent real industrial conditions and that observed performance gaps are driven primarily by OCR rather than other pipeline components.

What would settle it

If an OCR model selected or fine-tuned to minimize the new benchmark's retrieval failures shows no improvement in RAG accuracy on a fresh set of industrial documents compared with models chosen solely by WER or CER, the claimed mismatch would be falsified.

Figures

Figures reproduced from arXiv: 2605.00911 by Change Jia, Jingang Huang, Linglin Zhang, Lin Sun, Wang Dexian, Xiangzheng Zhang, Zhengwei Cheng.

**Figure 2.** Figure 2: Benchmarking OCR Robustness for RAG. or visually encoded semantics. As structural errors can significantly alter semantics (Anand et al., 2023; Kasem et al., 2022), but prior work evaluates prediction accuracy rather than downstream retrieval impact, and RAG evaluation frameworks like RAGAS (Es et al., 2025) and ARES (SaadFalcon et al., 2024) assume clean textual inputs, overlooking OCR as a critical up… view at source ↗

**Figure 3.** Figure 3: OCR accuracy versus RAG accuracy across document types. Four regimes emerge: OCR Reliable (high-high), LLM Compensates (low-high), Both Weak (low-low), and OCR Blind Spot (high-low). VisualStyle exemplifies the blind spot: 82.9% OCR accuracy yields only 53.0% RAG accuracy. across all eleven document types, indicating that under our fixed-pipeline evaluation setup, OCRinduced information loss creates a per… view at source ↗

**Figure 4.** Figure 4: RAG acc (%) across OCR models and document challenges. Ground-Truth represents perfect OCR. Color indicates performance: green (high) to red (low). downstream generation-side failures, and remains stable across representative retriever and chunking choices. A simple multimodal generation baseline also suggests that the effect is not solely an artifact of using a text-only generator, although categories su… view at source ↗

**Figure 5.** Figure 5: InduOCRBench Document Domain Distribution. MultiFont Individual documents share a consistent font style, while different documents adopt different font styles, such as Songti, Fangsong, and others. CrosspageTable Documents containing tables where a single logical table spans across two or more pages, requiring structural merging to restore data continuity. D RAG Pipeline Details D.1 Query Construction Fo… view at source ↗

**Figure 6.** Figure 6: Radar chart comparison of recall (left) and accuracy (right) performance across 11 document challenge [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at https://github.com/Qihoo360/InduOCRBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows OCR accuracy metrics fail to predict RAG performance on complex documents and releases a benchmark for it, but lacks a ground-truth text baseline to isolate OCR as the cause.

read the letter

The core observation here is that strong WER and CER scores on OCR do not reliably predict how well a RAG pipeline will retrieve and generate from the same pages when the documents have extreme layouts, tables, formulas, or non-standard order. The authors built a benchmark across 11 such categories, ran recent OCR models through a fixed OCR-first RAG setup, and reported clear drops in retrieval and generation quality that vary by document type. They also note the mismatch holds across a few pipeline choices and release the data publicly. That focus on downstream RAG impact rather than isolated character accuracy is the useful part, and the category-specific results could help practitioners see where current OCR tools still break real workflows. Releasing the benchmark is straightforward credit. The main limitation is the absence of a clean control: they do not feed the identical retriever, embedder, and generator the ground-truth text transcriptions of those same pages. Without that delta, the observed failures could stem from the inherent difficulty of the document structures themselves rather than from OCR errors specifically. The stability across pipeline variants does not substitute for the missing reference condition. Details on exact retrieval metrics, statistical tests, and error tracing are also thin in what is shown, which makes it harder to gauge effect sizes. This work is aimed at teams building or evaluating RAG systems that ingest scanned or visually complex industrial documents. A practitioner or benchmark designer would find the dataset and the mismatch examples worth looking at. It is worth sending to peer review because the practical gap it targets is real and the released resource has potential value, even if the experimental design needs the ground-truth baseline added to strengthen the central claim.

Referee Report

1 major / 2 minor

Summary. The paper introduces InduOCRBench, a benchmark for OCR robustness in industrial RAG systems spanning 11 challenging document categories (extreme layouts, tables, formulas, historical documents, etc.). It evaluates SOTA OCR models in a controlled OCR-first RAG pipeline and claims that high conventional OCR accuracy (low WER/CER) does not guarantee strong downstream RAG performance, as structural and semantic errors lead to retrieval and generation failures in a category-dependent manner that is stable across pipeline variants.

Significance. If the central findings hold, the work is significant for shifting OCR evaluation from isolated character-level metrics toward task-specific robustness in RAG pipelines, which is relevant for industrial document processing. The public benchmark release and analysis of both retrieval-side and generation-side failure modes are positive contributions that could guide future OCR and RAG research.

major comments (1)

[Experiments] The evaluation lacks a ground-truth text baseline: the RAG pipeline (retriever, chunker, embedder, generator) is never run on perfect transcriptions of the same pages. Without this delta, the observed performance gaps cannot be confidently attributed to OCR errors rather than the inherent properties of the 11 document categories (e.g., non-standard reading order, tables, formulas). This directly undermines the central claim that structural/semantic OCR errors cause substantial RAG failures even when WER/CER is low. (Evaluation / Experiments section describing the OCR-first pipeline and results.)

minor comments (2)

[Benchmark Description] A table explicitly listing the 11 categories with representative examples and statistics (e.g., page counts, average complexity) would improve clarity and reproducibility.
[Abstract] The abstract states the benchmark is publicly available; confirm that the GitHub repository includes the exact document images, ground-truth annotations if any, and pipeline code used for the reported results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the absence of a ground-truth text baseline limits the strength of our attribution of RAG failures to OCR errors, and we will revise the manuscript to address this.

read point-by-point responses

Referee: The evaluation lacks a ground-truth text baseline: the RAG pipeline (retriever, chunker, embedder, generator) is never run on perfect transcriptions of the same pages. Without this delta, the observed performance gaps cannot be confidently attributed to OCR errors rather than the inherent properties of the 11 document categories (e.g., non-standard reading order, tables, formulas). This directly undermines the central claim that structural/semantic OCR errors cause substantial RAG failures even when WER/CER is low. (Evaluation / Experiments section describing the OCR-first pipeline and results.)

Authors: We agree that a ground-truth text baseline is necessary to isolate the contribution of OCR errors from document-inherent difficulties. In the revised manuscript we will add experiments that run the identical RAG pipeline on perfect transcriptions of the same pages. This will allow direct computation of performance deltas attributable to OCR imperfections. We will report these results in the Experiments section, update the relevant tables and figures, and revise the discussion to clarify how the new baseline supports our claims about structural and semantic errors. The addition will not change the core experimental design but will strengthen the causal attribution. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations, fitted predictions, or self-referential steps

full rationale

The paper is a pure empirical benchmark evaluation: it defines 11 document categories, runs SOTA OCR models through a fixed RAG pipeline, and measures retrieval/generation metrics. No equations, no parameters fitted to subsets and then called predictions, no self-citation chains justifying uniqueness or ansatzes, and no renaming of known results as new derivations. The central claim (high OCR accuracy does not guarantee RAG performance) is established by direct side-by-side measurements on the released benchmark; the evaluation chain is self-contained against external data and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that the selected document types capture industrial OCR challenges and that the pipeline isolates OCR effects from other variables.

axioms (1)

domain assumption Industrial RAG systems depend on OCR to process visual documents
Directly stated in the opening of the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1112 out tokens · 51464 ms · 2026-05-09T20:00:29.576240+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

2021 , eprint=

PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System , author=. 2021 , eprint=

work page 2021
[2]

2025 , eprint=

PaddleOCR 3.0 Technical Report , author=. 2025 , eprint=

work page 2025
[3]

2024 , eprint=

MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. 2024 , eprint=

work page 2024
[4]

2023 , eprint=

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation , author=. 2023 , eprint=

work page 2023
[5]

2024 , eprint=

Notes on Applicability of GPT-4 to Document Understanding , author=. 2024 , eprint=

work page 2024
[6]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

work page 2024
[7]

2025 , eprint=

DeepSeek-OCR: Contexts Optical Compression , author=. 2025 , eprint=

work page 2025
[8]

2025 , eprint=

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=. 2025 , eprint=

work page 2025
[9]

Ocrbench: on the hidden mystery of ocr in large multimodal models

Liu, Yuliang and Li, Zhang and Huang, Mingxin and Yang, Biao and Yu, Wenwen and Li, Chunyuan and Yin, Xu-Cheng and Liu, Cheng-Lin and Jin, Lianwen and Bai, Xiang , year=. OCRBench: on the hidden mystery of OCR in large multimodal models , volume=. Science China Information Sciences , publisher=. doi:10.1007/s11432-024-4235-6 , number=

work page doi:10.1007/s11432-024-4235-6
[10]

2025 , eprint=

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations , author=. 2025 , eprint=

work page 2025
[11]

Modassir and Shah, Rajiv Ratn and Satoh, Shin’ichi , year=

Anand, Avinash and Jaiswal, Raj and Bhuyan, Pijush and Gupta, Mohit and Bangar, Siddhesh and Imam, Md. Modassir and Shah, Rajiv Ratn and Satoh, Shin’ichi , year=. TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content , url=. doi:10.1145/3606040.3617444 , booktitle=

work page doi:10.1145/3606040.3617444
[12]

2022 , eprint=

Deep learning for table detection and structure recognition: A survey , author=. 2022 , eprint=

work page 2022
[13]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

work page 2021
[14]

2025 , eprint=

Ragas: Automated Evaluation of Retrieval Augmented Generation , author=. 2025 , eprint=

work page 2025
[15]

2024 , eprint=

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author=. 2024 , eprint=

work page 2024
[16]

2025 , eprint=

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation , author=. 2025 , eprint=

work page 2025
[17]

2025 , eprint=

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. 2025 , eprint=

work page 2025
[18]

2025 , eprint=

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

HunyuanOCR Technical Report , author=. 2025 , eprint=

work page 2025
[20]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025
[21]

ArXiv , year=

Qwen3-VL Technical Report , author=. ArXiv , year=

work page
[22]

2024 , url=

GPT-4o System Card , author=. 2024 , url=

work page 2024
[23]

2025 , url=

gpt-oss-120b&gpt-oss-20b Model Card , author=. 2025 , url=

work page 2025
[24]

FlashRAG:

Jiajie Jin and Yutao Zhu and Zhicheng Dou and Guanting Dong and Xinyu Yang and Chenghao Zhang and Tong Zhao and Zhao Yang and Ji. FlashRAG:. Companion Proceedings of the. 2025 , url =

work page 2025
[25]

2023 , eprint=

Making Large Language Models A Better Foundation For Dense Retrieval , author=. 2023 , eprint=

work page 2023
[26]

2024 , eprint=

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=

work page 2024

[1] [1]

2021 , eprint=

PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System , author=. 2021 , eprint=

work page 2021

[2] [2]

2025 , eprint=

PaddleOCR 3.0 Technical Report , author=. 2025 , eprint=

work page 2025

[3] [3]

2024 , eprint=

MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. 2024 , eprint=

work page 2024

[4] [4]

2023 , eprint=

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation , author=. 2023 , eprint=

work page 2023

[5] [5]

2024 , eprint=

Notes on Applicability of GPT-4 to Document Understanding , author=. 2024 , eprint=

work page 2024

[6] [6]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

work page 2024

[7] [7]

2025 , eprint=

DeepSeek-OCR: Contexts Optical Compression , author=. 2025 , eprint=

work page 2025

[8] [8]

2025 , eprint=

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=. 2025 , eprint=

work page 2025

[9] [9]

Ocrbench: on the hidden mystery of ocr in large multimodal models

Liu, Yuliang and Li, Zhang and Huang, Mingxin and Yang, Biao and Yu, Wenwen and Li, Chunyuan and Yin, Xu-Cheng and Liu, Cheng-Lin and Jin, Lianwen and Bai, Xiang , year=. OCRBench: on the hidden mystery of OCR in large multimodal models , volume=. Science China Information Sciences , publisher=. doi:10.1007/s11432-024-4235-6 , number=

work page doi:10.1007/s11432-024-4235-6

[10] [10]

2025 , eprint=

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations , author=. 2025 , eprint=

work page 2025

[11] [11]

Modassir and Shah, Rajiv Ratn and Satoh, Shin’ichi , year=

Anand, Avinash and Jaiswal, Raj and Bhuyan, Pijush and Gupta, Mohit and Bangar, Siddhesh and Imam, Md. Modassir and Shah, Rajiv Ratn and Satoh, Shin’ichi , year=. TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content , url=. doi:10.1145/3606040.3617444 , booktitle=

work page doi:10.1145/3606040.3617444

[12] [12]

2022 , eprint=

Deep learning for table detection and structure recognition: A survey , author=. 2022 , eprint=

work page 2022

[13] [13]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

work page 2021

[14] [14]

2025 , eprint=

Ragas: Automated Evaluation of Retrieval Augmented Generation , author=. 2025 , eprint=

work page 2025

[15] [15]

2024 , eprint=

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author=. 2024 , eprint=

work page 2024

[16] [16]

2025 , eprint=

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation , author=. 2025 , eprint=

work page 2025

[17] [17]

2025 , eprint=

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. 2025 , eprint=

work page 2025

[18] [18]

2025 , eprint=

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , eprint=

work page 2025

[19] [19]

2025 , eprint=

HunyuanOCR Technical Report , author=. 2025 , eprint=

work page 2025

[20] [20]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025

[21] [21]

ArXiv , year=

Qwen3-VL Technical Report , author=. ArXiv , year=

work page

[22] [22]

2024 , url=

GPT-4o System Card , author=. 2024 , url=

work page 2024

[23] [23]

2025 , url=

gpt-oss-120b&gpt-oss-20b Model Card , author=. 2025 , url=

work page 2025

[24] [24]

FlashRAG:

Jiajie Jin and Yutao Zhu and Zhicheng Dou and Guanting Dong and Xinyu Yang and Chenghao Zhang and Tong Zhao and Zhao Yang and Ji. FlashRAG:. Companion Proceedings of the. 2025 , url =

work page 2025

[25] [25]

2023 , eprint=

Making Large Language Models A Better Foundation For Dense Retrieval , author=. 2023 , eprint=

work page 2023

[26] [26]

2024 , eprint=

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=

work page 2024