When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation
Pith reviewed 2026-05-09 20:00 UTC · model grok-4.3
The pith
High OCR accuracy does not ensure strong retrieval-augmented generation performance on realistic industrial documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
High OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. The mismatch between conventional OCR scores and RAG effectiveness is category-dependent, occurs on both retrieval and generation sides, and holds across representative OCR-first pipeline choices.
What carries the argument
An OCR-first RAG evaluation pipeline that measures end-to-end retrieval and answer quality on eleven industrial document categories instead of isolated character error rates.
If this is right
- Structural and semantic errors in OCR output produce retrieval failures even when character accuracy metrics look good.
- The performance gap between conventional OCR benchmarks and RAG effectiveness varies by document category.
- Both retrieval-side and generation-side failures contribute to the observed degradation.
- The mismatch remains stable across different representative OCR-first pipeline configurations.
Where Pith is reading between the lines
- RAG developers may need to optimize OCR components for layout preservation and semantic fidelity rather than character accuracy alone.
- Post-processing steps that repair structure and reading order could close part of the gap without changing the underlying OCR engine.
- Future OCR training objectives could incorporate retrieval or question-answering signals as additional supervision.
Load-bearing premise
The chosen eleven document categories and the controlled OCR-first RAG pipeline sufficiently represent real industrial conditions and that observed performance gaps are driven primarily by OCR rather than other pipeline components.
What would settle it
If an OCR model selected or fine-tuned to minimize the new benchmark's retrieval failures shows no improvement in RAG accuracy on a fresh set of industrial documents compared with models chosen solely by WER or CER, the claimed mismatch would be falsified.
Figures
read the original abstract
Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at https://github.com/Qihoo360/InduOCRBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InduOCRBench, a benchmark for OCR robustness in industrial RAG systems spanning 11 challenging document categories (extreme layouts, tables, formulas, historical documents, etc.). It evaluates SOTA OCR models in a controlled OCR-first RAG pipeline and claims that high conventional OCR accuracy (low WER/CER) does not guarantee strong downstream RAG performance, as structural and semantic errors lead to retrieval and generation failures in a category-dependent manner that is stable across pipeline variants.
Significance. If the central findings hold, the work is significant for shifting OCR evaluation from isolated character-level metrics toward task-specific robustness in RAG pipelines, which is relevant for industrial document processing. The public benchmark release and analysis of both retrieval-side and generation-side failure modes are positive contributions that could guide future OCR and RAG research.
major comments (1)
- [Experiments] The evaluation lacks a ground-truth text baseline: the RAG pipeline (retriever, chunker, embedder, generator) is never run on perfect transcriptions of the same pages. Without this delta, the observed performance gaps cannot be confidently attributed to OCR errors rather than the inherent properties of the 11 document categories (e.g., non-standard reading order, tables, formulas). This directly undermines the central claim that structural/semantic OCR errors cause substantial RAG failures even when WER/CER is low. (Evaluation / Experiments section describing the OCR-first pipeline and results.)
minor comments (2)
- [Benchmark Description] A table explicitly listing the 11 categories with representative examples and statistics (e.g., page counts, average complexity) would improve clarity and reproducibility.
- [Abstract] The abstract states the benchmark is publicly available; confirm that the GitHub repository includes the exact document images, ground-truth annotations if any, and pipeline code used for the reported results.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the absence of a ground-truth text baseline limits the strength of our attribution of RAG failures to OCR errors, and we will revise the manuscript to address this.
read point-by-point responses
-
Referee: The evaluation lacks a ground-truth text baseline: the RAG pipeline (retriever, chunker, embedder, generator) is never run on perfect transcriptions of the same pages. Without this delta, the observed performance gaps cannot be confidently attributed to OCR errors rather than the inherent properties of the 11 document categories (e.g., non-standard reading order, tables, formulas). This directly undermines the central claim that structural/semantic OCR errors cause substantial RAG failures even when WER/CER is low. (Evaluation / Experiments section describing the OCR-first pipeline and results.)
Authors: We agree that a ground-truth text baseline is necessary to isolate the contribution of OCR errors from document-inherent difficulties. In the revised manuscript we will add experiments that run the identical RAG pipeline on perfect transcriptions of the same pages. This will allow direct computation of performance deltas attributable to OCR imperfections. We will report these results in the Experiments section, update the relevant tables and figures, and revise the discussion to clarify how the new baseline supports our claims about structural and semantic errors. The addition will not change the core experimental design but will strengthen the causal attribution. revision: yes
Circularity Check
Empirical benchmarking study with no derivations, fitted predictions, or self-referential steps
full rationale
The paper is a pure empirical benchmark evaluation: it defines 11 document categories, runs SOTA OCR models through a fixed RAG pipeline, and measures retrieval/generation metrics. No equations, no parameters fitted to subsets and then called predictions, no self-citation chains justifying uniqueness or ansatzes, and no renaming of known results as new derivations. The central claim (high OCR accuracy does not guarantee RAG performance) is established by direct side-by-side measurements on the released benchmark; the evaluation chain is self-contained against external data and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Industrial RAG systems depend on OCR to process visual documents
Reference graph
Works this paper leans on
-
[1]
PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System , author=. 2021 , eprint=
work page 2021
- [2]
-
[3]
MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. 2024 , eprint=
work page 2024
-
[4]
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation , author=. 2023 , eprint=
work page 2023
-
[5]
Notes on Applicability of GPT-4 to Document Understanding , author=. 2024 , eprint=
work page 2024
-
[6]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=
work page 2024
- [7]
-
[8]
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=. 2025 , eprint=
work page 2025
-
[9]
Ocrbench: on the hidden mystery of ocr in large multimodal models
Liu, Yuliang and Li, Zhang and Huang, Mingxin and Yang, Biao and Yu, Wenwen and Li, Chunyuan and Yin, Xu-Cheng and Liu, Cheng-Lin and Jin, Lianwen and Bai, Xiang , year=. OCRBench: on the hidden mystery of OCR in large multimodal models , volume=. Science China Information Sciences , publisher=. doi:10.1007/s11432-024-4235-6 , number=
-
[10]
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations , author=. 2025 , eprint=
work page 2025
-
[11]
Modassir and Shah, Rajiv Ratn and Satoh, Shin’ichi , year=
Anand, Avinash and Jaiswal, Raj and Bhuyan, Pijush and Gupta, Mohit and Bangar, Siddhesh and Imam, Md. Modassir and Shah, Rajiv Ratn and Satoh, Shin’ichi , year=. TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content , url=. doi:10.1145/3606040.3617444 , booktitle=
-
[12]
Deep learning for table detection and structure recognition: A survey , author=. 2022 , eprint=
work page 2022
-
[13]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=
work page 2021
-
[14]
Ragas: Automated Evaluation of Retrieval Augmented Generation , author=. 2025 , eprint=
work page 2025
-
[15]
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author=. 2024 , eprint=
work page 2024
-
[16]
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation , author=. 2025 , eprint=
work page 2025
-
[17]
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. 2025 , eprint=
work page 2025
-
[18]
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , eprint=
work page 2025
- [19]
-
[20]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=
work page 2025
- [21]
- [22]
- [23]
- [24]
-
[25]
Making Large Language Models A Better Foundation For Dense Retrieval , author=. 2023 , eprint=
work page 2023
-
[26]
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.