pith. sign in

arxiv: 2604.10077 · v2 · pith:ETZRMLNPnew · submitted 2026-04-11 · 💻 cs.CV

DocRevive: A Unified Pipeline for Document Text Restoration

Pith reviewed 2026-05-25 06:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords document restorationtext inpaintingOCRdiffusion modelsdocument understandingsynthetic datasetUCSM metricocclusion detection
0
0 comments X

The pith

DocRevive combines OCR, masked language modeling, and diffusion models into a pipeline that restores text in damaged documents while preserving visual integrity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DocRevive as a unified pipeline that first applies OCR to detect and recognize text, then uses an occlusion detector to locate degraded regions, followed by masked language modeling to generate semantically coherent replacements and a diffusion-based module to reintegrate the text with matching font, size, and alignment. This addresses the problem of reconstructing damaged, occluded, or incomplete text, which can improve performance on later document understanding tasks. The authors also release a synthetic dataset of 30,078 degraded document images as a benchmark and introduce the Unified Context Similarity Metric to score restorations on edit, semantic, length, and contextual grounds.

Core claim

A pipeline that sequences state-of-the-art OCR for text detection and recognition, an occlusion detector for identifying degradation, inpainting models for semantically coherent reconstruction, and a diffusion-based module for seamless text reintegration can restore and reconstruct text in degraded documents while preserving visual integrity, as shown on the new OPRB synthetic dataset.

What carries the argument

The DocRevive unified pipeline that sequences OCR, occlusion detection, masked language modeling, inpainting, and diffusion-based text reintegration.

If this is right

  • Restored documents improve accuracy on downstream tasks such as information extraction and layout analysis.
  • The OPRB dataset functions as a standard benchmark for comparing future text restoration methods.
  • The UCSM metric supplies a finer-grained evaluation than edit distance alone by adding contextual predictability.
  • Archival and preservation workflows gain an automated step for handling damaged pages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular design suggests the pipeline could be inserted into existing document processing systems without major redesign.
  • Performance gaps between synthetic and authentic historical documents would highlight the need for more varied training data.
  • The same sequence of detection-plus-diffusion steps might apply to restoring scene text in natural photographs.

Load-bearing premise

The synthetic dataset of 30,078 degraded document images accurately simulates diverse real-world document degradation scenarios sufficiently to serve as a reliable benchmark for restoration performance.

What would settle it

If real-world degraded documents produce substantially lower UCSM scores or visibly mismatched text than the synthetic test results, the pipeline's claimed effectiveness would not hold.

Figures

Figures reproduced from arXiv: 2604.10077 by Ayan Banerjee, Josep Llados, Kunal Purkayastha, Umapada Pal.

Figure 1
Figure 1. Figure 1: Feature completeness across document datasets: OPRB is the only benchmark that jointly provides degradation￾aware labels, paired clean/degraded pages, word-level supervision, and restoration-oriented structure, making it more suitable for doc￾ument restoration than standard layout-only benchmarks. and authenticity of the original records. Traditional restora￾tion methods [7, 25] have primarily focused on e… view at source ↗
Figure 2
Figure 2. Figure 2: Sample visualization of occluding patches over a docu [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE comparison across document datasets: OPRB occupies both distinct and shared regions in the shared fea￾ture space, indicating that it captures degradation patterns and restoration-relevant document conditions not represented by stan￾dard clean-layout benchmarks [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PCA variance across datasets: OPRB exhibits broad variation in its document appearance and degradation patterns, which is important for evaluating restoration methods under di￾verse conditions. research, since a restoration benchmark should capture re￾alistic visual disruption rather than appear as a lightly per￾turbed layout dataset. Because OPRB is built from digi￾tal research pages, it still shares some… view at source ↗
Figure 5
Figure 5. Figure 5: DocRevive: a unified model architecture for the proposed document restoration pipeline. Given a degraded document image, the framework first performs OCR, followed by occlusion-aware blank region extraction to localize missing text areas. The surrounding word context is then grouped and converted into formatted prompt tokens, which are processed by the RoBERTa masked language model to predict the missing c… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the proposed framework on a real scanned document degraded using whiteboard marker ink. The document [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the proposed framework on a few OPRB document images for different types of occlusions. [Zoom in for better [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the state of the art industry models on a few OPRB document images for different types of occlusions. [Zoom in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Graphical visualization of effectiveness of UCSM over Edit Distance [Zoom in for better visualization] [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of document restoration methods across occlusion types. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents DocRevive, a unified pipeline for restoring damaged or occluded text in documents. It combines state-of-the-art OCR, image analysis for occlusion detection, masked language modeling for semantic coherence, and diffusion-based inpainting to reconstruct text while preserving visual properties such as font, size, and alignment. The work introduces the synthetic OPRB dataset of 30,078 degraded document images to benchmark restoration tasks and proposes the Unified Context Similarity Metric (UCSM), which combines edit, semantic, and length similarities with a contextual predictability penalty. The pipeline and dataset are claimed to set a new standard for document text reconstruction, with code and data released publicly.

Significance. If the central claims hold, the work could advance document restoration for archival and digital preservation applications by providing an integrated pipeline and an open benchmark. The public release of the OPRB dataset and code is a clear strength that enables reproducibility and follow-on research. However, the significance is currently limited by the lack of demonstrated fidelity between the synthetic degradations and real-world document conditions.

major comments (2)
  1. [Abstract / OPRB dataset section] The central claim that the pipeline 'sets a new standard' rests on quantitative results obtained exclusively on the synthetic OPRB dataset (Abstract and dataset description). No validation is reported that the generated degradations match the statistics of real archival documents (e.g., edge histograms, bleed-through distributions, or OCR error rates under non-uniform lighting). Without such evidence the benchmark results remain dataset-specific and do not yet support field-wide conclusions.
  2. [Evaluation / UCSM definition] The UCSM metric is presented as incorporating 'contextual predictability' to penalize deviations when text is obvious from context, yet no ablation or sensitivity analysis is shown for the weighting of its edit, semantic, length, and predictability components. This makes it difficult to assess whether UCSM provides a robust, non-circular evaluation of restoration quality beyond standard metrics.
minor comments (2)
  1. [Abstract] The abstract states the dataset size as '30{,}078' with an unusual comma placement; standardize to 30,078 throughout.
  2. [Pipeline overview] The pipeline diagram and component descriptions would benefit from explicit input/output specifications for each stage (OCR → occlusion detector → MLM → diffusion) to clarify data flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract / OPRB dataset section] The central claim that the pipeline 'sets a new standard' rests on quantitative results obtained exclusively on the synthetic OPRB dataset (Abstract and dataset description). No validation is reported that the generated degradations match the statistics of real archival documents (e.g., edge histograms, bleed-through distributions, or OCR error rates under non-uniform lighting). Without such evidence the benchmark results remain dataset-specific and do not yet support field-wide conclusions.

    Authors: We acknowledge that the manuscript reports no quantitative statistical validation (such as edge histograms or bleed-through distributions) comparing the synthetic degradations to real archival documents. The OPRB dataset was designed to simulate representative degradation types drawn from the document analysis literature, but this design rationale is not accompanied by direct empirical matching in the current text. In revision we will add a subsection to the dataset description that details the degradation generation process, cites supporting references for the chosen degradation models, and explicitly notes the absence of real-world statistical validation as a limitation. We will also revise the abstract and conclusion to state that the results establish performance on a controlled synthetic benchmark rather than claiming a field-wide standard. revision: partial

  2. Referee: [Evaluation / UCSM definition] The UCSM metric is presented as incorporating 'contextual predictability' to penalize deviations when text is obvious from context, yet no ablation or sensitivity analysis is shown for the weighting of its edit, semantic, length, and predictability components. This makes it difficult to assess whether UCSM provides a robust, non-circular evaluation of restoration quality beyond standard metrics.

    Authors: The observation is correct: the manuscript defines UCSM as a linear combination of edit, semantic, length, and contextual predictability terms but provides no ablation or sensitivity study on the relative weights. In the revised manuscript we will add an ablation subsection to the evaluation that systematically varies each component weight, reports the resulting metric values on the OPRB test set, and compares against standard metrics (CER, semantic similarity) to demonstrate that the combined score is not circular and offers additional diagnostic value. revision: yes

Circularity Check

0 steps flagged

No circularity in pipeline or metric proposal

full rationale

The manuscript describes an engineering pipeline (OCR + occlusion detection + MLM + diffusion inpainting) and introduces UCSM as a composite metric without any equations, fitted parameters, or predictions that reduce to their own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and the synthetic OPRB dataset is presented as a created benchmark rather than a self-referential evaluation. The central claims rest on external model components and a new metric definition that does not tautologically reproduce its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic degradations are representative and that the staged pipeline produces semantically coherent and visually consistent output; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Synthetic data of 30,078 images can serve as a valid benchmark for real document restoration tasks
    The paper states it creates this dataset to set a benchmark for restoration tasks.

pith-pipeline@v0.9.0 · 5776 in / 1276 out tokens · 23453 ms · 2026-05-25T06:36:08.310929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 5 internal anchors

  1. [1]

    Restoring and attributing ancient texts using deep neural networks.Nature, 603(7900):280–283, 2022

    Yannis Assael, Thea Sommerschield, Brendan Shillingford, Mahyar Bordbar, John Pavlopoulos, Marita Chatzipana- giotou, Ion Androutsopoulos, Jonathan Prag, and Nando de Freitas. Restoring and attributing ancient texts using deep neural networks.Nature, 603(7900):280–283, 2022. 3

  2. [2]

    TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

    Ayan Banerjee, Josep Llad ˜A`gs, Umapada Pal, and Anjan Dutta. Talediffusion: Multi-character story generation with dialogue rendering.arXiv preprint arXiv:2509.04123, 2025. 3

  3. [3]

    CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models

    Ayan Banerjee, Fernando Vilari ˜no, and Josep Llad´os. Craft- graffiti: Exploring human identity with custom graffiti art via facial-preserving diffusion models.arXiv preprint arXiv:2508.20640, 2025

  4. [4]

    Craftsvg: Multi-object text-to-svg synthesis via layout guided diffusion

    Ayan Banerjee, Nityanand Mathur, Josep Llados, Umapada Pal, and Anjan Dutta. Craftsvg: Multi-object text-to-svg synthesis via layout guided diffusion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2564–2574, 2026. 3

  5. [5]

    Scene text recognition with permuted autoregressive sequence models

    Darwin Bautista and Rowel Atienza. Scene text recognition with permuted autoregressive sequence models. InEuropean conference on computer vision, pages 178–196. Springer,

  6. [6]

    Mending fractured texts

    Jens Bjerring-Hansen, Ross Deans Kristensen-McLachlan, Philip Diderichsen, and Dorte Haltrup Hansen. Mending fractured texts. a heuristic procedure for correcting ocr data. InCEUR Workshop Proceedings, pages 177–186. ceur work- shop proceedings, 2022. 2

  7. [7]

    Selecting a restoration technique to minimize ocr error.IEEE Transactions on Neural Networks, 14(3):478–490, 2003

    Mike Cannon, Mike Fugate, Don R Hush, and Clint Scovel. Selecting a restoration technique to minimize ocr error.IEEE Transactions on Neural Networks, 14(3):478–490, 2003. 1

  8. [8]

    Linknet: Ex- ploiting encoder representations for efficient semantic seg- mentation

    Abhishek Chaurasia and Eugenio Culurciello. Linknet: Ex- ploiting encoder representations for efficient semantic seg- mentation. In2017 IEEE visual communications and image processing (VCIP), pages 1–4. IEEE, 2017. 8

  9. [9]

    Textdiffuser: Diffusion models as text painters

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. InAdvances in Neural Information Processing Sys- tems (NeurIPS) 36, 2023. 3

  10. [10]

    Simple baselines for image restoration

    Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. InEuropean confer- ence on computer vision, pages 17–33. Springer, 2022. 6

  11. [11]

    Fast: Faster arbitrarily-shaped text detector with minimalist kernel representation.arXiv preprint arXiv:2111.02394, 2021

    Zhe Chen, Jiahao Wang, Wenhai Wang, Guo Chen, Enze Xie, Ping Luo, and Tong Lu. Fast: Faster arbitrarily-shaped text detector with minimalist kernel representation.arXiv preprint arXiv:2111.02394, 2021. 4, 8

  12. [12]

    Pp-ocr: A practical ultra lightweight ocr system

    Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020. 2

  13. [13]

    Context per- ception parallel decoder for scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,

    Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. Context per- ception parallel decoder for scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,

  14. [14]

    Restoration of fragmentary babylonian texts us- ing recurrent neural networks.Proceedings of the National Academy of Sciences (PNAS), 117(37):22743–22751, 2020

    Ethan Fetaya, Yonatan Lifshitz, Elad Aaron, and Shai Gordin. Restoration of fragmentary babylonian texts us- ing recurrent neural networks.Proceedings of the National Academy of Sciences (PNAS), 117(37):22743–22751, 2020. 3

  15. [15]

    Unsupervised post- ocr correction for noisy text in engineering documents

    Mathieu Franc ¸ois and V´eronique Eglin. Unsupervised post- ocr correction for noisy text in engineering documents. In Proceedings of the 17th International Conference on Docu- ment Analysis and Recognition (ICDAR), 2023. 3

  16. [16]

    Advancing post-ocr correc- tion: A comparative study of synthetic data.arXiv preprint arXiv:2408.02253, 2024

    Shuhao Guan and Derek Greene. Advancing post-ocr correc- tion: A comparative study of synthetic data.arXiv preprint arXiv:2408.02253, 2024. 3

  17. [17]

    Self-supervised im- plicit glyph attention for text recognition

    Tongkun Guan, Chaochen Gu, Jingzheng Tu, Xue Yang, Qi Feng, Yudi Zhao, and Wei Shen. Self-supervised im- plicit glyph attention for text recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15285–15294, 2023. 3

  18. [18]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 7

  19. [19]

    Docbank: A benchmark dataset for document layout analysis

    Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. InProceedings of the 28th In- ternational Conference on Computational Linguistics, pages 949–960, 2020. 2, 5

  20. [20]

    TrOCR: Transformer-based optical character recogni- tion with pre-trained models

    Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. TrOCR: Transformer-based optical character recogni- tion with pre-trained models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 13094–13102,

  21. [21]

    Real-time scene text detection with differentiable bina- rization

    Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable bina- rization. InProceedings of the AAAI conference on artificial intelligence, pages 11474–11481, 2020. 8

  22. [22]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A ro- bustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019. 2, 5, 7

  23. [23]

    A new context-based method for restoring occluded text in natural scene images

    Ayush Mittal, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Michael Blumenstein, and Daniel Lopresti. A new context-based method for restoring occluded text in natural scene images. InDocument Analysis Systems: 14th IAPR International Workshop, DAS 2020, Wuhan, China, July 26– 29, 2020, Proceedings 14, pages 466–480. Springer, 2020. 2

  24. [24]

    S. Mori, C. Y . Suen, and K. Yamamoto. Historical review of OCR research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992. 3

  25. [25]

    Robust ocr of degraded documents

    Premkumar Natarajan, Issam Bazzi, Zhidong Lu, John Makhoul, and Richard Scwhartz. Robust ocr of degraded documents. InProceedings of the Fifth International Con- ference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pages 357–361. IEEE, 1999. 1

  26. [26]

    N. Otsu. A threshold selection method from gray-level his- tograms.IEEE Transactions on Systems, Man, and Cyber- netics, 9(1):62–66, 1979. 2

  27. [27]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  28. [28]

    Layereddoc: Domain adaptive document restoration with a layer separa- tion approach

    Maria Pilligua, Nil Biescas, Javier Vazquez-Corral, Josep Llad´os, Ernest Valveny, and Sanket Biswas. Layereddoc: Domain adaptive document restoration with a layer separa- tion approach. InInternational Conference on Document Analysis and Recognition, pages 27–39. Springer, 2024. 2

  29. [29]

    Datr: Domain agnostic text recognizer

    Kunal Purkayastha, Shashwat Sarkar, Shivakumara Palaiah- nakote, Umapada Pal, and Palash Ghosal. Datr: Domain agnostic text recognizer. InInternational Conference on Pat- tern Recognition, pages 220–235. Springer, 2025. 8

  30. [30]

    Yolo26: Key architectural enhancements and performance bench- marking for real-time object detection

    R Sapkota, RH Cheppally, A Sharda, and M Karkee. Yolo26: Key architectural enhancements and performance bench- marking for real-time object detection. arxiv 2025.arXiv preprint arXiv:2509.25164. 7

  31. [31]

    Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recog- nition and its application to scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11):2298–2304, 2017. 3

  32. [32]

    Type-r: Au- tomatically retouching typos for text-to-image generation

    Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seiichi Uchida, and Kota Yamaguchi. Type-r: Au- tomatically retouching typos for text-to-image generation. arXiv preprint arXiv:2411.18159, 2024. 3

  33. [33]

    De-GAN: a conditional generative adversar- ial network for document enhancement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1180– 1191, 2020

    Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Forn ´es, Josep Llad ´os, and Umapada Pal. De-GAN: a conditional generative adversar- ial network for document enhancement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1180– 1191, 2020. 2

  34. [34]

    Docentr: An end-to-end document image enhancement transformer

    Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Forn ´es, Josep Llad ´os, and Umapada Pal. Docentr: An end-to-end document image enhancement transformer. In2022 26th International Con- ference on Pattern Recognition (ICPR), pages 1699–1705. IEEE, 2022. 2

  35. [35]

    Text- DIAE: A self-supervised degradation invariant autoencoder for text recognition and document enhancement

    Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Forn ´es, Yousri Kessentini, Josep Llad´os, Lluis G ´omez, and Dimosthenis Karatzas. Text- DIAE: A self-supervised degradation invariant autoencoder for text recognition and document enhancement. InProceed- ings of the AAAI Conference on Artificial Intelligence, 2023. 2

  36. [36]

    B. Su, S. Lu, and C. L. Tan. Robust document image bi- narization technique for degraded document images.IEEE Transactions on Image Processing, 22(4):1408–1417, 2013. 2

  37. [37]

    Unifying vision, text, and layout for universal doc- ument processing

    Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal doc- ument processing. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 19254–19264, 2023. 3

  38. [38]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 7

  39. [39]

    Leverag- ing LLMs for post-ocr correction of historical newspapers

    Alan Thomas, Robert Gaizauskas, and Haiping Lu. Leverag- ing LLMs for post-ocr correction of historical newspapers. In Proceedings of the LT4HALA Workshop at LREC-COLING, pages 116–121, 2024. 3

  40. [40]

    Yolov10: Real-time end-to-end object detection,

    Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, and Guiguang Ding. Yolov10: Real-time end- to-end object detection.arXiv preprint arXiv:2405.14458,

  41. [41]

    Yolov9: Learning what you want to learn using programmable gradient information,

    Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn us- ing programmable gradient information.arXiv preprint arXiv:2402.13616, 2024. 2, 4, 7

  42. [42]

    Symmetrical linguis- tic feature distillation with clip for scene text recognition

    Zixiao Wang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Bo- qiang Zhang, and Yongdong Zhang. Symmetrical linguis- tic feature distillation with clip for scene text recognition. InProceedings of the 31st ACM international conference on multimedia, pages 509–518, 2023. 8

  43. [43]

    Ote: exploring accurate scene text recognition us- ing one token

    Jianjun Xu, Yuxin Wang, Hongtao Xie, and Yongdong Zhang. Ote: exploring accurate scene text recognition us- ing one token. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28327– 28336, 2024. 1

  44. [44]

    DocDiff: Document enhancement via residual diffu- sion models

    Zongyuan Yang, Baolin Liu, Yongping Xiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, and Xing Zhang. DocDiff: Document enhancement via residual diffu- sion models. InProceedings of the 31st ACM International Conference on Multimedia (ACM MM), pages 2795–2806,

  45. [45]

    Docdiff: Document enhancement via residual diffu- sion models

    Zongyuan Yang, Baolin Liu, Yongping Xxiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, and Xing Zhang. Docdiff: Document enhancement via residual diffu- sion models. InProceedings of the 31st ACM international conference on multimedia, pages 2795–2806, 2023. 6

  46. [46]

    What is yolov8: an in-depth exploration of the internal features of the next-generation object detector (2024).Accessed: Sep, 10, 2025

    Muhammad Yaseen. What is yolov8: an in-depth exploration of the internal features of the next-generation object detector (2024).Accessed: Sep, 10, 2025. 7

  47. [47]

    DocReal: Robust document dewarping of real-life images via attention-enhanced control point prediction

    Fangchen Yu, Yina Xie, Lei Wu, Yafei Wen, Guozhi Wang, Shuai Ren, Xiaoxin Chen, Jianfeng Mao, and Wenye Li. DocReal: Robust document dewarping of real-life images via attention-enhanced control point prediction. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 665–674, 2024. 2

  48. [48]

    A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007

    Li Yujian and Liu Bo. A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007. 2

  49. [49]

    Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2025

    Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, and Yu Zhou. Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2025. 3, 5

  50. [50]

    Linguistic more: Taking a further step toward efficient and accurate scene text recognition

    Boqiang Zhang, Hongtao Xie, Yuxin Wang, Jianjun Xu, and Yongdong Zhang. Linguistic more: Taking a further step toward efficient and accurate scene text recognition. InPro- ceedings of the 32nd International Joint Conference on Arti- ficial Intelligence (IJCAI), pages 1704–1712, 2023. 3

  51. [51]

    Choose what you need: Disentangled representation learning for scene text recognition removal and editing

    Boqiang Zhang, Hongtao Xie, Zuan Gao, and Yuxin Wang. Choose what you need: Disentangled representation learning for scene text recognition removal and editing. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28358–28368, 2024. 1

  52. [52]

    DocRes: A generalist model toward uni- fying document image restoration tasks

    Jiaxin Zhang, Dezhi Peng, Chongyu Liu, Peirong Zhang, and Lianwen Jin. DocRes: A generalist model toward uni- fying document image restoration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  53. [53]

    Document image shadow removal guided by color-aware background

    Ling Zhang, Yinxiao He, Qing Zhang, Zheng Liu, Xiao- long Zhang, and Chunxia Xiao. Document image shadow removal guided by color-aware background. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1818–1827, 2023. 3

  54. [54]

    A review of document image en- hancement based on document degradation problem.Ap- plied Sciences, 13(13):7855, 2023

    Yanxi Zhou, Shikai Zuo, Zhengxian Yang, Jinlong He, Jian- wen Shi, and Rui Zhang. A review of document image en- hancement based on document degradation problem.Ap- plied Sciences, 13(13):7855, 2023. 1

  55. [55]

    Text image inpainting via global structure-guided diffusion models

    Shipeng Zhu, Pengfei Fang, Chenjie Zhu, Zuoyan Zhao, Qiang Xu, and Hui Xue. Text image inpainting via global structure-guided diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7775– 7783, 2024. 3, 5 DocRevive: A Unified Pipeline for Document Text Restoration Supplementary Material

  56. [56]

    In the current generator, we chooseNunique source pages per class-level

    Dataset Construction Details This supplementary section provides the full construc- tion details of the Occluded Pages Restoration Benchmark (OPRB). In the current generator, we chooseNunique source pages per class-level. We introduce a novel benchmark dataset called Occluded Pages Restoration Benchmark (OPRB) designed to eval- uate document restoration u...

  57. [57]

    We evaluate the on three benchmark datasets

    Method Details 10.1. Occlusion Detection and Blank Region Ex- traction Occlusion patches are first localized using a fine-tuned YOLOv9c detector [41] trained on the OPRB dataset. The benchmark contains six degradation classes,Black Ink, Burnt,Whitener,Dust,Scribble, andStamp. Opaque classes (Black Ink,Burnt,Whitener) fully obscure the un- derlying text, t...

  58. [58]

    We evaluate the on three benchmark datasets

    Misceleneous Experiments 11.1. Comparison with Prior Document Restora- tion Methods We compare DocRevive against three prior methods on a subset of 498 images from OPRB (83 per occlusion type) DocDiff [45], GSDM (standalone), our pipeline’s inpaint- ing module run in isolation without any text prediction or editing and NAFNet [10], a strong general image ...