DocRevive: A Unified Pipeline for Document Text Restoration

Ayan Banerjee; Josep Llados; Kunal Purkayastha; Umapada Pal

arxiv: 2604.10077 · v2 · pith:ETZRMLNPnew · submitted 2026-04-11 · 💻 cs.CV

DocRevive: A Unified Pipeline for Document Text Restoration

Kunal Purkayastha , Ayan Banerjee , Josep Llados , Umapada Pal This is my paper

Pith reviewed 2026-05-25 06:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords document restorationtext inpaintingOCRdiffusion modelsdocument understandingsynthetic datasetUCSM metricocclusion detection

0 comments

The pith

DocRevive combines OCR, masked language modeling, and diffusion models into a pipeline that restores text in damaged documents while preserving visual integrity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DocRevive as a unified pipeline that first applies OCR to detect and recognize text, then uses an occlusion detector to locate degraded regions, followed by masked language modeling to generate semantically coherent replacements and a diffusion-based module to reintegrate the text with matching font, size, and alignment. This addresses the problem of reconstructing damaged, occluded, or incomplete text, which can improve performance on later document understanding tasks. The authors also release a synthetic dataset of 30,078 degraded document images as a benchmark and introduce the Unified Context Similarity Metric to score restorations on edit, semantic, length, and contextual grounds.

Core claim

A pipeline that sequences state-of-the-art OCR for text detection and recognition, an occlusion detector for identifying degradation, inpainting models for semantically coherent reconstruction, and a diffusion-based module for seamless text reintegration can restore and reconstruct text in degraded documents while preserving visual integrity, as shown on the new OPRB synthetic dataset.

What carries the argument

The DocRevive unified pipeline that sequences OCR, occlusion detection, masked language modeling, inpainting, and diffusion-based text reintegration.

If this is right

Restored documents improve accuracy on downstream tasks such as information extraction and layout analysis.
The OPRB dataset functions as a standard benchmark for comparing future text restoration methods.
The UCSM metric supplies a finer-grained evaluation than edit distance alone by adding contextual predictability.
Archival and preservation workflows gain an automated step for handling damaged pages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular design suggests the pipeline could be inserted into existing document processing systems without major redesign.
Performance gaps between synthetic and authentic historical documents would highlight the need for more varied training data.
The same sequence of detection-plus-diffusion steps might apply to restoring scene text in natural photographs.

Load-bearing premise

The synthetic dataset of 30,078 degraded document images accurately simulates diverse real-world document degradation scenarios sufficiently to serve as a reliable benchmark for restoration performance.

What would settle it

If real-world degraded documents produce substantially lower UCSM scores or visibly mismatched text than the synthetic test results, the pipeline's claimed effectiveness would not hold.

Figures

Figures reproduced from arXiv: 2604.10077 by Ayan Banerjee, Josep Llados, Kunal Purkayastha, Umapada Pal.

**Figure 1.** Figure 1: Feature completeness across document datasets: OPRB is the only benchmark that jointly provides degradationaware labels, paired clean/degraded pages, word-level supervision, and restoration-oriented structure, making it more suitable for document restoration than standard layout-only benchmarks. and authenticity of the original records. Traditional restoration methods [7, 25] have primarily focused on e… view at source ↗

**Figure 2.** Figure 2: Sample visualization of occluding patches over a docu [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE comparison across document datasets: OPRB occupies both distinct and shared regions in the shared feature space, indicating that it captures degradation patterns and restoration-relevant document conditions not represented by standard clean-layout benchmarks [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: PCA variance across datasets: OPRB exhibits broad variation in its document appearance and degradation patterns, which is important for evaluating restoration methods under diverse conditions. research, since a restoration benchmark should capture realistic visual disruption rather than appear as a lightly perturbed layout dataset. Because OPRB is built from digital research pages, it still shares some… view at source ↗

**Figure 5.** Figure 5: DocRevive: a unified model architecture for the proposed document restoration pipeline. Given a degraded document image, the framework first performs OCR, followed by occlusion-aware blank region extraction to localize missing text areas. The surrounding word context is then grouped and converted into formatted prompt tokens, which are processed by the RoBERTa masked language model to predict the missing c… view at source ↗

**Figure 6.** Figure 6: Visualization of the proposed framework on a real scanned document degraded using whiteboard marker ink. The document [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the proposed framework on a few OPRB document images for different types of occlusions. [Zoom in for better [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of the state of the art industry models on a few OPRB document images for different types of occlusions. [Zoom in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Graphical visualization of effectiveness of UCSM over Edit Distance [Zoom in for better visualization] [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of document restoration methods across occlusion types. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DocRevive assembles existing OCR, inpainting, and diffusion pieces into one pipeline, ships a 30k synthetic dataset and UCSM metric, and releases the code, but all numbers come from that unvalidated synthetic set.

read the letter

The paper's core move is to chain standard OCR, an occlusion detector, masked language modeling, and diffusion inpainting into a single flow that tries to restore both the text and its visual fit. They also generate and release the OPRB synthetic dataset of 30,078 images plus the UCSM metric that blends edit, semantic, and contextual scores. The GitHub and Hugging Face links are real and public, which is the clearest positive here because it lets others run or extend the work without starting from scratch. That kind of artifact release is useful for a narrow but practical task like archival document cleanup. The main weakness is exactly the one in the stress-test note: the entire evaluation sits on synthetic degradations whose match to real scans is not shown. No Kolmogorov-Smirnov checks, no side-by-side human ratings against actual damaged pages, and no mention of how the synthetic process handles bleed-through or uneven lighting. Without that, the UCSM numbers and the 'new standard' phrasing stay tied to the benchmark rather than proving broader progress. The abstract also gives no quantitative results or ablations, so it is impossible to tell whether the pipeline actually beats running the components separately. This is for people already working on document restoration pipelines who want a ready-to-try starting point and the dataset for their own tests. A reader who needs code or a new benchmark to compare against could get something out of it. I would send it to peer review because the releases give referees something concrete to examine, even if the synthetic-data gap needs to be addressed in revision.

Referee Report

2 major / 2 minor

Summary. The paper presents DocRevive, a unified pipeline for restoring damaged or occluded text in documents. It combines state-of-the-art OCR, image analysis for occlusion detection, masked language modeling for semantic coherence, and diffusion-based inpainting to reconstruct text while preserving visual properties such as font, size, and alignment. The work introduces the synthetic OPRB dataset of 30,078 degraded document images to benchmark restoration tasks and proposes the Unified Context Similarity Metric (UCSM), which combines edit, semantic, and length similarities with a contextual predictability penalty. The pipeline and dataset are claimed to set a new standard for document text reconstruction, with code and data released publicly.

Significance. If the central claims hold, the work could advance document restoration for archival and digital preservation applications by providing an integrated pipeline and an open benchmark. The public release of the OPRB dataset and code is a clear strength that enables reproducibility and follow-on research. However, the significance is currently limited by the lack of demonstrated fidelity between the synthetic degradations and real-world document conditions.

major comments (2)

[Abstract / OPRB dataset section] The central claim that the pipeline 'sets a new standard' rests on quantitative results obtained exclusively on the synthetic OPRB dataset (Abstract and dataset description). No validation is reported that the generated degradations match the statistics of real archival documents (e.g., edge histograms, bleed-through distributions, or OCR error rates under non-uniform lighting). Without such evidence the benchmark results remain dataset-specific and do not yet support field-wide conclusions.
[Evaluation / UCSM definition] The UCSM metric is presented as incorporating 'contextual predictability' to penalize deviations when text is obvious from context, yet no ablation or sensitivity analysis is shown for the weighting of its edit, semantic, length, and predictability components. This makes it difficult to assess whether UCSM provides a robust, non-circular evaluation of restoration quality beyond standard metrics.

minor comments (2)

[Abstract] The abstract states the dataset size as '30{,}078' with an unusual comma placement; standardize to 30,078 throughout.
[Pipeline overview] The pipeline diagram and component descriptions would benefit from explicit input/output specifications for each stage (OCR → occlusion detector → MLM → diffusion) to clarify data flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract / OPRB dataset section] The central claim that the pipeline 'sets a new standard' rests on quantitative results obtained exclusively on the synthetic OPRB dataset (Abstract and dataset description). No validation is reported that the generated degradations match the statistics of real archival documents (e.g., edge histograms, bleed-through distributions, or OCR error rates under non-uniform lighting). Without such evidence the benchmark results remain dataset-specific and do not yet support field-wide conclusions.

Authors: We acknowledge that the manuscript reports no quantitative statistical validation (such as edge histograms or bleed-through distributions) comparing the synthetic degradations to real archival documents. The OPRB dataset was designed to simulate representative degradation types drawn from the document analysis literature, but this design rationale is not accompanied by direct empirical matching in the current text. In revision we will add a subsection to the dataset description that details the degradation generation process, cites supporting references for the chosen degradation models, and explicitly notes the absence of real-world statistical validation as a limitation. We will also revise the abstract and conclusion to state that the results establish performance on a controlled synthetic benchmark rather than claiming a field-wide standard. revision: partial
Referee: [Evaluation / UCSM definition] The UCSM metric is presented as incorporating 'contextual predictability' to penalize deviations when text is obvious from context, yet no ablation or sensitivity analysis is shown for the weighting of its edit, semantic, length, and predictability components. This makes it difficult to assess whether UCSM provides a robust, non-circular evaluation of restoration quality beyond standard metrics.

Authors: The observation is correct: the manuscript defines UCSM as a linear combination of edit, semantic, length, and contextual predictability terms but provides no ablation or sensitivity study on the relative weights. In the revised manuscript we will add an ablation subsection to the evaluation that systematically varies each component weight, reports the resulting metric values on the OPRB test set, and compares against standard metrics (CER, semantic similarity) to demonstrate that the combined score is not circular and offers additional diagnostic value. revision: yes

Circularity Check

0 steps flagged

No circularity in pipeline or metric proposal

full rationale

The manuscript describes an engineering pipeline (OCR + occlusion detection + MLM + diffusion inpainting) and introduces UCSM as a composite metric without any equations, fitted parameters, or predictions that reduce to their own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and the synthetic OPRB dataset is presented as a created benchmark rather than a self-referential evaluation. The central claims rest on external model components and a new metric definition that does not tautologically reproduce its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic degradations are representative and that the staged pipeline produces semantically coherent and visually consistent output; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Synthetic data of 30,078 images can serve as a valid benchmark for real document restoration tasks
The paper states it creates this dataset to set a benchmark for restoration tasks.

pith-pipeline@v0.9.0 · 5776 in / 1276 out tokens · 23453 ms · 2026-05-25T06:36:08.310929+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 5 internal anchors

[1]

Restoring and attributing ancient texts using deep neural networks.Nature, 603(7900):280–283, 2022

Yannis Assael, Thea Sommerschield, Brendan Shillingford, Mahyar Bordbar, John Pavlopoulos, Marita Chatzipana- giotou, Ion Androutsopoulos, Jonathan Prag, and Nando de Freitas. Restoring and attributing ancient texts using deep neural networks.Nature, 603(7900):280–283, 2022. 3

work page 2022
[2]

TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Ayan Banerjee, Josep Llad ˜A`gs, Umapada Pal, and Anjan Dutta. Talediffusion: Multi-character story generation with dialogue rendering.arXiv preprint arXiv:2509.04123, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models

Ayan Banerjee, Fernando Vilari ˜no, and Josep Llad´os. Craft- graffiti: Exploring human identity with custom graffiti art via facial-preserving diffusion models.arXiv preprint arXiv:2508.20640, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Craftsvg: Multi-object text-to-svg synthesis via layout guided diffusion

Ayan Banerjee, Nityanand Mathur, Josep Llados, Umapada Pal, and Anjan Dutta. Craftsvg: Multi-object text-to-svg synthesis via layout guided diffusion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2564–2574, 2026. 3

work page 2026
[5]

Scene text recognition with permuted autoregressive sequence models

Darwin Bautista and Rowel Atienza. Scene text recognition with permuted autoregressive sequence models. InEuropean conference on computer vision, pages 178–196. Springer,

work page
[6]

Mending fractured texts

Jens Bjerring-Hansen, Ross Deans Kristensen-McLachlan, Philip Diderichsen, and Dorte Haltrup Hansen. Mending fractured texts. a heuristic procedure for correcting ocr data. InCEUR Workshop Proceedings, pages 177–186. ceur work- shop proceedings, 2022. 2

work page 2022
[7]

Selecting a restoration technique to minimize ocr error.IEEE Transactions on Neural Networks, 14(3):478–490, 2003

Mike Cannon, Mike Fugate, Don R Hush, and Clint Scovel. Selecting a restoration technique to minimize ocr error.IEEE Transactions on Neural Networks, 14(3):478–490, 2003. 1

work page 2003
[8]

Linknet: Ex- ploiting encoder representations for efficient semantic seg- mentation

Abhishek Chaurasia and Eugenio Culurciello. Linknet: Ex- ploiting encoder representations for efficient semantic seg- mentation. In2017 IEEE visual communications and image processing (VCIP), pages 1–4. IEEE, 2017. 8

work page 2017
[9]

Textdiffuser: Diffusion models as text painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. InAdvances in Neural Information Processing Sys- tems (NeurIPS) 36, 2023. 3

work page 2023
[10]

Simple baselines for image restoration

Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. InEuropean confer- ence on computer vision, pages 17–33. Springer, 2022. 6

work page 2022
[11]

Fast: Faster arbitrarily-shaped text detector with minimalist kernel representation.arXiv preprint arXiv:2111.02394, 2021

Zhe Chen, Jiahao Wang, Wenhai Wang, Guo Chen, Enze Xie, Ping Luo, and Tong Lu. Fast: Faster arbitrarily-shaped text detector with minimalist kernel representation.arXiv preprint arXiv:2111.02394, 2021. 4, 8

work page arXiv 2021
[12]

Pp-ocr: A practical ultra lightweight ocr system

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020. 2

work page arXiv 2009
[13]

Context per- ception parallel decoder for scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,

Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. Context per- ception parallel decoder for scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page
[14]

Restoration of fragmentary babylonian texts us- ing recurrent neural networks.Proceedings of the National Academy of Sciences (PNAS), 117(37):22743–22751, 2020

Ethan Fetaya, Yonatan Lifshitz, Elad Aaron, and Shai Gordin. Restoration of fragmentary babylonian texts us- ing recurrent neural networks.Proceedings of the National Academy of Sciences (PNAS), 117(37):22743–22751, 2020. 3

work page 2020
[15]

Unsupervised post- ocr correction for noisy text in engineering documents

Mathieu Franc ¸ois and V´eronique Eglin. Unsupervised post- ocr correction for noisy text in engineering documents. In Proceedings of the 17th International Conference on Docu- ment Analysis and Recognition (ICDAR), 2023. 3

work page 2023
[16]

Advancing post-ocr correc- tion: A comparative study of synthetic data.arXiv preprint arXiv:2408.02253, 2024

Shuhao Guan and Derek Greene. Advancing post-ocr correc- tion: A comparative study of synthetic data.arXiv preprint arXiv:2408.02253, 2024. 3

work page arXiv 2024
[17]

Self-supervised im- plicit glyph attention for text recognition

Tongkun Guan, Chaochen Gu, Jingzheng Tu, Xue Yang, Qi Feng, Yudi Zhao, and Wei Shen. Self-supervised im- plicit glyph attention for text recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15285–15294, 2023. 3

work page 2023
[18]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Docbank: A benchmark dataset for document layout analysis

Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. InProceedings of the 28th In- ternational Conference on Computational Linguistics, pages 949–960, 2020. 2, 5

work page 2020
[20]

TrOCR: Transformer-based optical character recogni- tion with pre-trained models

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. TrOCR: Transformer-based optical character recogni- tion with pre-trained models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 13094–13102,

work page
[21]

Real-time scene text detection with differentiable bina- rization

Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable bina- rization. InProceedings of the AAAI conference on artificial intelligence, pages 11474–11481, 2020. 8

work page 2020
[22]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A ro- bustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 1907
[23]

A new context-based method for restoring occluded text in natural scene images

Ayush Mittal, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Michael Blumenstein, and Daniel Lopresti. A new context-based method for restoring occluded text in natural scene images. InDocument Analysis Systems: 14th IAPR International Workshop, DAS 2020, Wuhan, China, July 26– 29, 2020, Proceedings 14, pages 466–480. Springer, 2020. 2

work page 2020
[24]

S. Mori, C. Y . Suen, and K. Yamamoto. Historical review of OCR research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992. 3

work page 1992
[25]

Robust ocr of degraded documents

Premkumar Natarajan, Issam Bazzi, Zhidong Lu, John Makhoul, and Richard Scwhartz. Robust ocr of degraded documents. InProceedings of the Fifth International Con- ference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pages 357–361. IEEE, 1999. 1

work page 1999
[26]

N. Otsu. A threshold selection method from gray-level his- tograms.IEEE Transactions on Systems, Man, and Cyber- netics, 9(1):62–66, 1979. 2

work page 1979
[27]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page
[28]

Layereddoc: Domain adaptive document restoration with a layer separa- tion approach

Maria Pilligua, Nil Biescas, Javier Vazquez-Corral, Josep Llad´os, Ernest Valveny, and Sanket Biswas. Layereddoc: Domain adaptive document restoration with a layer separa- tion approach. InInternational Conference on Document Analysis and Recognition, pages 27–39. Springer, 2024. 2

work page 2024
[29]

Datr: Domain agnostic text recognizer

Kunal Purkayastha, Shashwat Sarkar, Shivakumara Palaiah- nakote, Umapada Pal, and Palash Ghosal. Datr: Domain agnostic text recognizer. InInternational Conference on Pat- tern Recognition, pages 220–235. Springer, 2025. 8

work page 2025
[30]

Yolo26: Key architectural enhancements and performance bench- marking for real-time object detection

R Sapkota, RH Cheppally, A Sharda, and M Karkee. Yolo26: Key architectural enhancements and performance bench- marking for real-time object detection. arxiv 2025.arXiv preprint arXiv:2509.25164. 7

work page arXiv 2025
[31]

Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recog- nition and its application to scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11):2298–2304, 2017. 3

work page 2017
[32]

Type-r: Au- tomatically retouching typos for text-to-image generation

Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seiichi Uchida, and Kota Yamaguchi. Type-r: Au- tomatically retouching typos for text-to-image generation. arXiv preprint arXiv:2411.18159, 2024. 3

work page arXiv 2024
[33]

De-GAN: a conditional generative adversar- ial network for document enhancement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1180– 1191, 2020

Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Forn ´es, Josep Llad ´os, and Umapada Pal. De-GAN: a conditional generative adversar- ial network for document enhancement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1180– 1191, 2020. 2

work page 2020
[34]

Docentr: An end-to-end document image enhancement transformer

Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Forn ´es, Josep Llad ´os, and Umapada Pal. Docentr: An end-to-end document image enhancement transformer. In2022 26th International Con- ference on Pattern Recognition (ICPR), pages 1699–1705. IEEE, 2022. 2

work page 2022
[35]

Text- DIAE: A self-supervised degradation invariant autoencoder for text recognition and document enhancement

Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Forn ´es, Yousri Kessentini, Josep Llad´os, Lluis G ´omez, and Dimosthenis Karatzas. Text- DIAE: A self-supervised degradation invariant autoencoder for text recognition and document enhancement. InProceed- ings of the AAAI Conference on Artificial Intelligence, 2023. 2

work page 2023
[36]

B. Su, S. Lu, and C. L. Tan. Robust document image bi- narization technique for degraded document images.IEEE Transactions on Image Processing, 22(4):1408–1417, 2013. 2

work page 2013
[37]

Unifying vision, text, and layout for universal doc- ument processing

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal doc- ument processing. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 19254–19264, 2023. 3

work page 2023
[38]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Leverag- ing LLMs for post-ocr correction of historical newspapers

Alan Thomas, Robert Gaizauskas, and Haiping Lu. Leverag- ing LLMs for post-ocr correction of historical newspapers. In Proceedings of the LT4HALA Workshop at LREC-COLING, pages 116–121, 2024. 3

work page 2024
[40]

Yolov10: Real-time end-to-end object detection,

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, and Guiguang Ding. Yolov10: Real-time end- to-end object detection.arXiv preprint arXiv:2405.14458,

work page arXiv
[41]

Yolov9: Learning what you want to learn using programmable gradient information,

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn us- ing programmable gradient information.arXiv preprint arXiv:2402.13616, 2024. 2, 4, 7

work page arXiv 2024
[42]

Symmetrical linguis- tic feature distillation with clip for scene text recognition

Zixiao Wang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Bo- qiang Zhang, and Yongdong Zhang. Symmetrical linguis- tic feature distillation with clip for scene text recognition. InProceedings of the 31st ACM international conference on multimedia, pages 509–518, 2023. 8

work page 2023
[43]

Ote: exploring accurate scene text recognition us- ing one token

Jianjun Xu, Yuxin Wang, Hongtao Xie, and Yongdong Zhang. Ote: exploring accurate scene text recognition us- ing one token. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28327– 28336, 2024. 1

work page 2024
[44]

DocDiff: Document enhancement via residual diffu- sion models

Zongyuan Yang, Baolin Liu, Yongping Xiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, and Xing Zhang. DocDiff: Document enhancement via residual diffu- sion models. InProceedings of the 31st ACM International Conference on Multimedia (ACM MM), pages 2795–2806,

work page
[45]

Docdiff: Document enhancement via residual diffu- sion models

Zongyuan Yang, Baolin Liu, Yongping Xxiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, and Xing Zhang. Docdiff: Document enhancement via residual diffu- sion models. InProceedings of the 31st ACM international conference on multimedia, pages 2795–2806, 2023. 6

work page 2023
[46]

What is yolov8: an in-depth exploration of the internal features of the next-generation object detector (2024).Accessed: Sep, 10, 2025

Muhammad Yaseen. What is yolov8: an in-depth exploration of the internal features of the next-generation object detector (2024).Accessed: Sep, 10, 2025. 7

work page 2024
[47]

DocReal: Robust document dewarping of real-life images via attention-enhanced control point prediction

Fangchen Yu, Yina Xie, Lei Wu, Yafei Wen, Guozhi Wang, Shuai Ren, Xiaoxin Chen, Jianfeng Mao, and Wenye Li. DocReal: Robust document dewarping of real-life images via attention-enhanced control point prediction. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 665–674, 2024. 2

work page 2024
[48]

A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007

Li Yujian and Liu Bo. A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007. 2

work page 2007
[49]

Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2025

Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, and Yu Zhou. Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2025. 3, 5

work page 2025
[50]

Linguistic more: Taking a further step toward efficient and accurate scene text recognition

Boqiang Zhang, Hongtao Xie, Yuxin Wang, Jianjun Xu, and Yongdong Zhang. Linguistic more: Taking a further step toward efficient and accurate scene text recognition. InPro- ceedings of the 32nd International Joint Conference on Arti- ficial Intelligence (IJCAI), pages 1704–1712, 2023. 3

work page 2023
[51]

Choose what you need: Disentangled representation learning for scene text recognition removal and editing

Boqiang Zhang, Hongtao Xie, Zuan Gao, and Yuxin Wang. Choose what you need: Disentangled representation learning for scene text recognition removal and editing. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28358–28368, 2024. 1

work page 2024
[52]

DocRes: A generalist model toward uni- fying document image restoration tasks

Jiaxin Zhang, Dezhi Peng, Chongyu Liu, Peirong Zhang, and Lianwen Jin. DocRes: A generalist model toward uni- fying document image restoration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[53]

Document image shadow removal guided by color-aware background

Ling Zhang, Yinxiao He, Qing Zhang, Zheng Liu, Xiao- long Zhang, and Chunxia Xiao. Document image shadow removal guided by color-aware background. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1818–1827, 2023. 3

work page 2023
[54]

A review of document image en- hancement based on document degradation problem.Ap- plied Sciences, 13(13):7855, 2023

Yanxi Zhou, Shikai Zuo, Zhengxian Yang, Jinlong He, Jian- wen Shi, and Rui Zhang. A review of document image en- hancement based on document degradation problem.Ap- plied Sciences, 13(13):7855, 2023. 1

work page 2023
[55]

Text image inpainting via global structure-guided diffusion models

Shipeng Zhu, Pengfei Fang, Chenjie Zhu, Zuoyan Zhao, Qiang Xu, and Hui Xue. Text image inpainting via global structure-guided diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7775– 7783, 2024. 3, 5 DocRevive: A Unified Pipeline for Document Text Restoration Supplementary Material

work page 2024
[56]

In the current generator, we chooseNunique source pages per class-level

Dataset Construction Details This supplementary section provides the full construc- tion details of the Occluded Pages Restoration Benchmark (OPRB). In the current generator, we chooseNunique source pages per class-level. We introduce a novel benchmark dataset called Occluded Pages Restoration Benchmark (OPRB) designed to eval- uate document restoration u...

work page
[57]

We evaluate the on three benchmark datasets

Method Details 10.1. Occlusion Detection and Blank Region Ex- traction Occlusion patches are first localized using a fine-tuned YOLOv9c detector [41] trained on the OPRB dataset. The benchmark contains six degradation classes,Black Ink, Burnt,Whitener,Dust,Scribble, andStamp. Opaque classes (Black Ink,Burnt,Whitener) fully obscure the un- derlying text, t...

work page
[58]

We evaluate the on three benchmark datasets

Misceleneous Experiments 11.1. Comparison with Prior Document Restora- tion Methods We compare DocRevive against three prior methods on a subset of 498 images from OPRB (83 per occlusion type) DocDiff [45], GSDM (standalone), our pipeline’s inpaint- ing module run in isolation without any text prediction or editing and NAFNet [10], a strong general image ...

work page arXiv

[1] [1]

Restoring and attributing ancient texts using deep neural networks.Nature, 603(7900):280–283, 2022

Yannis Assael, Thea Sommerschield, Brendan Shillingford, Mahyar Bordbar, John Pavlopoulos, Marita Chatzipana- giotou, Ion Androutsopoulos, Jonathan Prag, and Nando de Freitas. Restoring and attributing ancient texts using deep neural networks.Nature, 603(7900):280–283, 2022. 3

work page 2022

[2] [2]

TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Ayan Banerjee, Josep Llad ˜A`gs, Umapada Pal, and Anjan Dutta. Talediffusion: Multi-character story generation with dialogue rendering.arXiv preprint arXiv:2509.04123, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models

Ayan Banerjee, Fernando Vilari ˜no, and Josep Llad´os. Craft- graffiti: Exploring human identity with custom graffiti art via facial-preserving diffusion models.arXiv preprint arXiv:2508.20640, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Craftsvg: Multi-object text-to-svg synthesis via layout guided diffusion

Ayan Banerjee, Nityanand Mathur, Josep Llados, Umapada Pal, and Anjan Dutta. Craftsvg: Multi-object text-to-svg synthesis via layout guided diffusion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2564–2574, 2026. 3

work page 2026

[5] [5]

Scene text recognition with permuted autoregressive sequence models

Darwin Bautista and Rowel Atienza. Scene text recognition with permuted autoregressive sequence models. InEuropean conference on computer vision, pages 178–196. Springer,

work page

[6] [6]

Mending fractured texts

Jens Bjerring-Hansen, Ross Deans Kristensen-McLachlan, Philip Diderichsen, and Dorte Haltrup Hansen. Mending fractured texts. a heuristic procedure for correcting ocr data. InCEUR Workshop Proceedings, pages 177–186. ceur work- shop proceedings, 2022. 2

work page 2022

[7] [7]

Selecting a restoration technique to minimize ocr error.IEEE Transactions on Neural Networks, 14(3):478–490, 2003

Mike Cannon, Mike Fugate, Don R Hush, and Clint Scovel. Selecting a restoration technique to minimize ocr error.IEEE Transactions on Neural Networks, 14(3):478–490, 2003. 1

work page 2003

[8] [8]

Linknet: Ex- ploiting encoder representations for efficient semantic seg- mentation

Abhishek Chaurasia and Eugenio Culurciello. Linknet: Ex- ploiting encoder representations for efficient semantic seg- mentation. In2017 IEEE visual communications and image processing (VCIP), pages 1–4. IEEE, 2017. 8

work page 2017

[9] [9]

Textdiffuser: Diffusion models as text painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. InAdvances in Neural Information Processing Sys- tems (NeurIPS) 36, 2023. 3

work page 2023

[10] [10]

Simple baselines for image restoration

Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. InEuropean confer- ence on computer vision, pages 17–33. Springer, 2022. 6

work page 2022

[11] [11]

Fast: Faster arbitrarily-shaped text detector with minimalist kernel representation.arXiv preprint arXiv:2111.02394, 2021

Zhe Chen, Jiahao Wang, Wenhai Wang, Guo Chen, Enze Xie, Ping Luo, and Tong Lu. Fast: Faster arbitrarily-shaped text detector with minimalist kernel representation.arXiv preprint arXiv:2111.02394, 2021. 4, 8

work page arXiv 2021

[12] [12]

Pp-ocr: A practical ultra lightweight ocr system

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020. 2

work page arXiv 2009

[13] [13]

Context per- ception parallel decoder for scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,

Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. Context per- ception parallel decoder for scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page

[14] [14]

Restoration of fragmentary babylonian texts us- ing recurrent neural networks.Proceedings of the National Academy of Sciences (PNAS), 117(37):22743–22751, 2020

Ethan Fetaya, Yonatan Lifshitz, Elad Aaron, and Shai Gordin. Restoration of fragmentary babylonian texts us- ing recurrent neural networks.Proceedings of the National Academy of Sciences (PNAS), 117(37):22743–22751, 2020. 3

work page 2020

[15] [15]

Unsupervised post- ocr correction for noisy text in engineering documents

Mathieu Franc ¸ois and V´eronique Eglin. Unsupervised post- ocr correction for noisy text in engineering documents. In Proceedings of the 17th International Conference on Docu- ment Analysis and Recognition (ICDAR), 2023. 3

work page 2023

[16] [16]

Advancing post-ocr correc- tion: A comparative study of synthetic data.arXiv preprint arXiv:2408.02253, 2024

Shuhao Guan and Derek Greene. Advancing post-ocr correc- tion: A comparative study of synthetic data.arXiv preprint arXiv:2408.02253, 2024. 3

work page arXiv 2024

[17] [17]

Self-supervised im- plicit glyph attention for text recognition

Tongkun Guan, Chaochen Gu, Jingzheng Tu, Xue Yang, Qi Feng, Yudi Zhao, and Wei Shen. Self-supervised im- plicit glyph attention for text recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15285–15294, 2023. 3

work page 2023

[18] [18]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Docbank: A benchmark dataset for document layout analysis

Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. InProceedings of the 28th In- ternational Conference on Computational Linguistics, pages 949–960, 2020. 2, 5

work page 2020

[20] [20]

TrOCR: Transformer-based optical character recogni- tion with pre-trained models

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. TrOCR: Transformer-based optical character recogni- tion with pre-trained models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 13094–13102,

work page

[21] [21]

Real-time scene text detection with differentiable bina- rization

Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable bina- rization. InProceedings of the AAAI conference on artificial intelligence, pages 11474–11481, 2020. 8

work page 2020

[22] [22]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A ro- bustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 1907

[23] [23]

A new context-based method for restoring occluded text in natural scene images

Ayush Mittal, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Michael Blumenstein, and Daniel Lopresti. A new context-based method for restoring occluded text in natural scene images. InDocument Analysis Systems: 14th IAPR International Workshop, DAS 2020, Wuhan, China, July 26– 29, 2020, Proceedings 14, pages 466–480. Springer, 2020. 2

work page 2020

[24] [24]

S. Mori, C. Y . Suen, and K. Yamamoto. Historical review of OCR research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992. 3

work page 1992

[25] [25]

Robust ocr of degraded documents

Premkumar Natarajan, Issam Bazzi, Zhidong Lu, John Makhoul, and Richard Scwhartz. Robust ocr of degraded documents. InProceedings of the Fifth International Con- ference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pages 357–361. IEEE, 1999. 1

work page 1999

[26] [26]

N. Otsu. A threshold selection method from gray-level his- tograms.IEEE Transactions on Systems, Man, and Cyber- netics, 9(1):62–66, 1979. 2

work page 1979

[27] [27]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page

[28] [28]

Layereddoc: Domain adaptive document restoration with a layer separa- tion approach

Maria Pilligua, Nil Biescas, Javier Vazquez-Corral, Josep Llad´os, Ernest Valveny, and Sanket Biswas. Layereddoc: Domain adaptive document restoration with a layer separa- tion approach. InInternational Conference on Document Analysis and Recognition, pages 27–39. Springer, 2024. 2

work page 2024

[29] [29]

Datr: Domain agnostic text recognizer

Kunal Purkayastha, Shashwat Sarkar, Shivakumara Palaiah- nakote, Umapada Pal, and Palash Ghosal. Datr: Domain agnostic text recognizer. InInternational Conference on Pat- tern Recognition, pages 220–235. Springer, 2025. 8

work page 2025

[30] [30]

Yolo26: Key architectural enhancements and performance bench- marking for real-time object detection

R Sapkota, RH Cheppally, A Sharda, and M Karkee. Yolo26: Key architectural enhancements and performance bench- marking for real-time object detection. arxiv 2025.arXiv preprint arXiv:2509.25164. 7

work page arXiv 2025

[31] [31]

Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recog- nition and its application to scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11):2298–2304, 2017. 3

work page 2017

[32] [32]

Type-r: Au- tomatically retouching typos for text-to-image generation

Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seiichi Uchida, and Kota Yamaguchi. Type-r: Au- tomatically retouching typos for text-to-image generation. arXiv preprint arXiv:2411.18159, 2024. 3

work page arXiv 2024

[33] [33]

De-GAN: a conditional generative adversar- ial network for document enhancement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1180– 1191, 2020

Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Forn ´es, Josep Llad ´os, and Umapada Pal. De-GAN: a conditional generative adversar- ial network for document enhancement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1180– 1191, 2020. 2

work page 2020

[34] [34]

Docentr: An end-to-end document image enhancement transformer

Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Forn ´es, Josep Llad ´os, and Umapada Pal. Docentr: An end-to-end document image enhancement transformer. In2022 26th International Con- ference on Pattern Recognition (ICPR), pages 1699–1705. IEEE, 2022. 2

work page 2022

[35] [35]

Text- DIAE: A self-supervised degradation invariant autoencoder for text recognition and document enhancement

Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Forn ´es, Yousri Kessentini, Josep Llad´os, Lluis G ´omez, and Dimosthenis Karatzas. Text- DIAE: A self-supervised degradation invariant autoencoder for text recognition and document enhancement. InProceed- ings of the AAAI Conference on Artificial Intelligence, 2023. 2

work page 2023

[36] [36]

B. Su, S. Lu, and C. L. Tan. Robust document image bi- narization technique for degraded document images.IEEE Transactions on Image Processing, 22(4):1408–1417, 2013. 2

work page 2013

[37] [37]

Unifying vision, text, and layout for universal doc- ument processing

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal doc- ument processing. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 19254–19264, 2023. 3

work page 2023

[38] [38]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Leverag- ing LLMs for post-ocr correction of historical newspapers

Alan Thomas, Robert Gaizauskas, and Haiping Lu. Leverag- ing LLMs for post-ocr correction of historical newspapers. In Proceedings of the LT4HALA Workshop at LREC-COLING, pages 116–121, 2024. 3

work page 2024

[40] [40]

Yolov10: Real-time end-to-end object detection,

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, and Guiguang Ding. Yolov10: Real-time end- to-end object detection.arXiv preprint arXiv:2405.14458,

work page arXiv

[41] [41]

Yolov9: Learning what you want to learn using programmable gradient information,

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn us- ing programmable gradient information.arXiv preprint arXiv:2402.13616, 2024. 2, 4, 7

work page arXiv 2024

[42] [42]

Symmetrical linguis- tic feature distillation with clip for scene text recognition

Zixiao Wang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Bo- qiang Zhang, and Yongdong Zhang. Symmetrical linguis- tic feature distillation with clip for scene text recognition. InProceedings of the 31st ACM international conference on multimedia, pages 509–518, 2023. 8

work page 2023

[43] [43]

Ote: exploring accurate scene text recognition us- ing one token

Jianjun Xu, Yuxin Wang, Hongtao Xie, and Yongdong Zhang. Ote: exploring accurate scene text recognition us- ing one token. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28327– 28336, 2024. 1

work page 2024

[44] [44]

DocDiff: Document enhancement via residual diffu- sion models

Zongyuan Yang, Baolin Liu, Yongping Xiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, and Xing Zhang. DocDiff: Document enhancement via residual diffu- sion models. InProceedings of the 31st ACM International Conference on Multimedia (ACM MM), pages 2795–2806,

work page

[45] [45]

Docdiff: Document enhancement via residual diffu- sion models

Zongyuan Yang, Baolin Liu, Yongping Xxiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, and Xing Zhang. Docdiff: Document enhancement via residual diffu- sion models. InProceedings of the 31st ACM international conference on multimedia, pages 2795–2806, 2023. 6

work page 2023

[46] [46]

What is yolov8: an in-depth exploration of the internal features of the next-generation object detector (2024).Accessed: Sep, 10, 2025

Muhammad Yaseen. What is yolov8: an in-depth exploration of the internal features of the next-generation object detector (2024).Accessed: Sep, 10, 2025. 7

work page 2024

[47] [47]

DocReal: Robust document dewarping of real-life images via attention-enhanced control point prediction

Fangchen Yu, Yina Xie, Lei Wu, Yafei Wen, Guozhi Wang, Shuai Ren, Xiaoxin Chen, Jianfeng Mao, and Wenye Li. DocReal: Robust document dewarping of real-life images via attention-enhanced control point prediction. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 665–674, 2024. 2

work page 2024

[48] [48]

A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007

Li Yujian and Liu Bo. A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007. 2

work page 2007

[49] [49]

Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2025

Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, and Yu Zhou. Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2025. 3, 5

work page 2025

[50] [50]

Linguistic more: Taking a further step toward efficient and accurate scene text recognition

Boqiang Zhang, Hongtao Xie, Yuxin Wang, Jianjun Xu, and Yongdong Zhang. Linguistic more: Taking a further step toward efficient and accurate scene text recognition. InPro- ceedings of the 32nd International Joint Conference on Arti- ficial Intelligence (IJCAI), pages 1704–1712, 2023. 3

work page 2023

[51] [51]

Choose what you need: Disentangled representation learning for scene text recognition removal and editing

Boqiang Zhang, Hongtao Xie, Zuan Gao, and Yuxin Wang. Choose what you need: Disentangled representation learning for scene text recognition removal and editing. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28358–28368, 2024. 1

work page 2024

[52] [52]

DocRes: A generalist model toward uni- fying document image restoration tasks

Jiaxin Zhang, Dezhi Peng, Chongyu Liu, Peirong Zhang, and Lianwen Jin. DocRes: A generalist model toward uni- fying document image restoration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024

[53] [53]

Document image shadow removal guided by color-aware background

Ling Zhang, Yinxiao He, Qing Zhang, Zheng Liu, Xiao- long Zhang, and Chunxia Xiao. Document image shadow removal guided by color-aware background. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1818–1827, 2023. 3

work page 2023

[54] [54]

A review of document image en- hancement based on document degradation problem.Ap- plied Sciences, 13(13):7855, 2023

Yanxi Zhou, Shikai Zuo, Zhengxian Yang, Jinlong He, Jian- wen Shi, and Rui Zhang. A review of document image en- hancement based on document degradation problem.Ap- plied Sciences, 13(13):7855, 2023. 1

work page 2023

[55] [55]

Text image inpainting via global structure-guided diffusion models

Shipeng Zhu, Pengfei Fang, Chenjie Zhu, Zuoyan Zhao, Qiang Xu, and Hui Xue. Text image inpainting via global structure-guided diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7775– 7783, 2024. 3, 5 DocRevive: A Unified Pipeline for Document Text Restoration Supplementary Material

work page 2024

[56] [56]

In the current generator, we chooseNunique source pages per class-level

Dataset Construction Details This supplementary section provides the full construc- tion details of the Occluded Pages Restoration Benchmark (OPRB). In the current generator, we chooseNunique source pages per class-level. We introduce a novel benchmark dataset called Occluded Pages Restoration Benchmark (OPRB) designed to eval- uate document restoration u...

work page

[57] [57]

We evaluate the on three benchmark datasets

Method Details 10.1. Occlusion Detection and Blank Region Ex- traction Occlusion patches are first localized using a fine-tuned YOLOv9c detector [41] trained on the OPRB dataset. The benchmark contains six degradation classes,Black Ink, Burnt,Whitener,Dust,Scribble, andStamp. Opaque classes (Black Ink,Burnt,Whitener) fully obscure the un- derlying text, t...

work page

[58] [58]

We evaluate the on three benchmark datasets

Misceleneous Experiments 11.1. Comparison with Prior Document Restora- tion Methods We compare DocRevive against three prior methods on a subset of 498 images from OPRB (83 per occlusion type) DocDiff [45], GSDM (standalone), our pipeline’s inpaint- ing module run in isolation without any text prediction or editing and NAFNet [10], a strong general image ...

work page arXiv