pith. sign in

arxiv: 2605.00911 · v1 · submitted 2026-04-29 · 💻 cs.CV

When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation

Pith reviewed 2026-05-09 20:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords OCR robustnessRetrieval-augmented generationDocument benchmarkIndustrial documentsRAG pipelineStructural errorsSemantic fidelity
0
0 comments X

The pith

High OCR accuracy does not ensure strong retrieval-augmented generation performance on realistic industrial documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard character-level OCR metrics such as word and character error rates fail to capture how well extracted text supports downstream retrieval and generation in RAG systems. It presents a new benchmark covering eleven document categories with extreme layouts, watermarks, tables, formulas, and non-standard reading orders. When recent OCR models are run through a controlled OCR-first RAG pipeline, retrieval failures rise sharply on these documents even though conventional benchmark scores remain high. The gaps arise from structural and semantic distortions rather than simple character mistakes and appear consistently across document types and pipeline variations.

Core claim

High OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. The mismatch between conventional OCR scores and RAG effectiveness is category-dependent, occurs on both retrieval and generation sides, and holds across representative OCR-first pipeline choices.

What carries the argument

An OCR-first RAG evaluation pipeline that measures end-to-end retrieval and answer quality on eleven industrial document categories instead of isolated character error rates.

If this is right

  • Structural and semantic errors in OCR output produce retrieval failures even when character accuracy metrics look good.
  • The performance gap between conventional OCR benchmarks and RAG effectiveness varies by document category.
  • Both retrieval-side and generation-side failures contribute to the observed degradation.
  • The mismatch remains stable across different representative OCR-first pipeline configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RAG developers may need to optimize OCR components for layout preservation and semantic fidelity rather than character accuracy alone.
  • Post-processing steps that repair structure and reading order could close part of the gap without changing the underlying OCR engine.
  • Future OCR training objectives could incorporate retrieval or question-answering signals as additional supervision.

Load-bearing premise

The chosen eleven document categories and the controlled OCR-first RAG pipeline sufficiently represent real industrial conditions and that observed performance gaps are driven primarily by OCR rather than other pipeline components.

What would settle it

If an OCR model selected or fine-tuned to minimize the new benchmark's retrieval failures shows no improvement in RAG accuracy on a fresh set of industrial documents compared with models chosen solely by WER or CER, the claimed mismatch would be falsified.

Figures

Figures reproduced from arXiv: 2605.00911 by Change Jia, Jingang Huang, Linglin Zhang, Lin Sun, Wang Dexian, Xiangzheng Zhang, Zhengwei Cheng.

Figure 1
Figure 1. Figure 1: OCR strips strikethrough, turning a clear con [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Benchmarking OCR Robustness for RAG. or visually encoded semantics. As structural er￾rors can significantly alter semantics (Anand et al., 2023; Kasem et al., 2022), but prior work evalu￾ates prediction accuracy rather than downstream retrieval impact, and RAG evaluation frameworks like RAGAS (Es et al., 2025) and ARES (Saad￾Falcon et al., 2024) assume clean textual inputs, overlooking OCR as a critical up… view at source ↗
Figure 3
Figure 3. Figure 3: OCR accuracy versus RAG accuracy across document types. Four regimes emerge: OCR Reliable (high-high), LLM Compensates (low-high), Both Weak (low-low), and OCR Blind Spot (high-low). VisualStyle exemplifies the blind spot: 82.9% OCR accuracy yields only 53.0% RAG accuracy. across all eleven document types, indicating that under our fixed-pipeline evaluation setup, OCR￾induced information loss creates a per… view at source ↗
Figure 4
Figure 4. Figure 4: RAG acc (%) across OCR models and docu￾ment challenges. Ground-Truth represents perfect OCR. Color indicates performance: green (high) to red (low). downstream generation-side failures, and remains stable across representative retriever and chunking choices. A simple multimodal generation baseline also suggests that the effect is not solely an artifact of using a text-only generator, although categories su… view at source ↗
Figure 5
Figure 5. Figure 5: InduOCRBench Document Domain Distribu￾tion. MultiFont Individual documents share a consis￾tent font style, while different documents adopt different font styles, such as Songti, Fangsong, and others. CrosspageTable Documents containing tables where a single logical table spans across two or more pages, requiring structural merging to restore data continuity. D RAG Pipeline Details D.1 Query Construction Fo… view at source ↗
Figure 6
Figure 6. Figure 6: Radar chart comparison of recall (left) and accuracy (right) performance across 11 document challenge [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at https://github.com/Qihoo360/InduOCRBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces InduOCRBench, a benchmark for OCR robustness in industrial RAG systems spanning 11 challenging document categories (extreme layouts, tables, formulas, historical documents, etc.). It evaluates SOTA OCR models in a controlled OCR-first RAG pipeline and claims that high conventional OCR accuracy (low WER/CER) does not guarantee strong downstream RAG performance, as structural and semantic errors lead to retrieval and generation failures in a category-dependent manner that is stable across pipeline variants.

Significance. If the central findings hold, the work is significant for shifting OCR evaluation from isolated character-level metrics toward task-specific robustness in RAG pipelines, which is relevant for industrial document processing. The public benchmark release and analysis of both retrieval-side and generation-side failure modes are positive contributions that could guide future OCR and RAG research.

major comments (1)
  1. [Experiments] The evaluation lacks a ground-truth text baseline: the RAG pipeline (retriever, chunker, embedder, generator) is never run on perfect transcriptions of the same pages. Without this delta, the observed performance gaps cannot be confidently attributed to OCR errors rather than the inherent properties of the 11 document categories (e.g., non-standard reading order, tables, formulas). This directly undermines the central claim that structural/semantic OCR errors cause substantial RAG failures even when WER/CER is low. (Evaluation / Experiments section describing the OCR-first pipeline and results.)
minor comments (2)
  1. [Benchmark Description] A table explicitly listing the 11 categories with representative examples and statistics (e.g., page counts, average complexity) would improve clarity and reproducibility.
  2. [Abstract] The abstract states the benchmark is publicly available; confirm that the GitHub repository includes the exact document images, ground-truth annotations if any, and pipeline code used for the reported results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the absence of a ground-truth text baseline limits the strength of our attribution of RAG failures to OCR errors, and we will revise the manuscript to address this.

read point-by-point responses
  1. Referee: The evaluation lacks a ground-truth text baseline: the RAG pipeline (retriever, chunker, embedder, generator) is never run on perfect transcriptions of the same pages. Without this delta, the observed performance gaps cannot be confidently attributed to OCR errors rather than the inherent properties of the 11 document categories (e.g., non-standard reading order, tables, formulas). This directly undermines the central claim that structural/semantic OCR errors cause substantial RAG failures even when WER/CER is low. (Evaluation / Experiments section describing the OCR-first pipeline and results.)

    Authors: We agree that a ground-truth text baseline is necessary to isolate the contribution of OCR errors from document-inherent difficulties. In the revised manuscript we will add experiments that run the identical RAG pipeline on perfect transcriptions of the same pages. This will allow direct computation of performance deltas attributable to OCR imperfections. We will report these results in the Experiments section, update the relevant tables and figures, and revise the discussion to clarify how the new baseline supports our claims about structural and semantic errors. The addition will not change the core experimental design but will strengthen the causal attribution. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations, fitted predictions, or self-referential steps

full rationale

The paper is a pure empirical benchmark evaluation: it defines 11 document categories, runs SOTA OCR models through a fixed RAG pipeline, and measures retrieval/generation metrics. No equations, no parameters fitted to subsets and then called predictions, no self-citation chains justifying uniqueness or ansatzes, and no renaming of known results as new derivations. The central claim (high OCR accuracy does not guarantee RAG performance) is established by direct side-by-side measurements on the released benchmark; the evaluation chain is self-contained against external data and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that the selected document types capture industrial OCR challenges and that the pipeline isolates OCR effects from other variables.

axioms (1)
  • domain assumption Industrial RAG systems depend on OCR to process visual documents
    Directly stated in the opening of the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1112 out tokens · 51464 ms · 2026-05-09T20:00:29.576240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    2021 , eprint=

    PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System , author=. 2021 , eprint=

  2. [2]

    2025 , eprint=

    PaddleOCR 3.0 Technical Report , author=. 2025 , eprint=

  3. [3]

    2024 , eprint=

    MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. 2024 , eprint=

  4. [4]

    2023 , eprint=

    Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation , author=. 2023 , eprint=

  5. [5]

    2024 , eprint=

    Notes on Applicability of GPT-4 to Document Understanding , author=. 2024 , eprint=

  6. [6]

    2024 , eprint=

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

  7. [7]

    2025 , eprint=

    DeepSeek-OCR: Contexts Optical Compression , author=. 2025 , eprint=

  8. [8]

    2025 , eprint=

    dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=. 2025 , eprint=

  9. [9]

    Ocrbench: on the hidden mystery of ocr in large multimodal models

    Liu, Yuliang and Li, Zhang and Huang, Mingxin and Yang, Biao and Yu, Wenwen and Li, Chunyuan and Yin, Xu-Cheng and Liu, Cheng-Lin and Jin, Lianwen and Bai, Xiang , year=. OCRBench: on the hidden mystery of OCR in large multimodal models , volume=. Science China Information Sciences , publisher=. doi:10.1007/s11432-024-4235-6 , number=

  10. [10]

    2025 , eprint=

    OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations , author=. 2025 , eprint=

  11. [11]

    Modassir and Shah, Rajiv Ratn and Satoh, Shin’ichi , year=

    Anand, Avinash and Jaiswal, Raj and Bhuyan, Pijush and Gupta, Mohit and Bangar, Siddhesh and Imam, Md. Modassir and Shah, Rajiv Ratn and Satoh, Shin’ichi , year=. TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content , url=. doi:10.1145/3606040.3617444 , booktitle=

  12. [12]

    2022 , eprint=

    Deep learning for table detection and structure recognition: A survey , author=. 2022 , eprint=

  13. [13]

    2021 , eprint=

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

  14. [14]

    2025 , eprint=

    Ragas: Automated Evaluation of Retrieval Augmented Generation , author=. 2025 , eprint=

  15. [15]

    2024 , eprint=

    ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author=. 2024 , eprint=

  16. [16]

    2025 , eprint=

    OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation , author=. 2025 , eprint=

  17. [17]

    2025 , eprint=

    MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. 2025 , eprint=

  18. [18]

    2025 , eprint=

    PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , eprint=

  19. [19]

    2025 , eprint=

    HunyuanOCR Technical Report , author=. 2025 , eprint=

  20. [20]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  21. [21]

    ArXiv , year=

    Qwen3-VL Technical Report , author=. ArXiv , year=

  22. [22]

    2024 , url=

    GPT-4o System Card , author=. 2024 , url=

  23. [23]

    2025 , url=

    gpt-oss-120b&gpt-oss-20b Model Card , author=. 2025 , url=

  24. [24]

    FlashRAG:

    Jiajie Jin and Yutao Zhu and Zhicheng Dou and Guanting Dong and Xinyu Yang and Chenghao Zhang and Tong Zhao and Zhao Yang and Ji. FlashRAG:. Companion Proceedings of the. 2025 , url =

  25. [25]

    2023 , eprint=

    Making Large Language Models A Better Foundation For Dense Retrieval , author=. 2023 , eprint=

  26. [26]

    2024 , eprint=

    BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=