PILOT: A Promptable Interleaved Layout-aware OCR Transformer

Amine Tamasna; Laziz Hamdi; Pascal Boisson; Thierry Paquet

arxiv: 2504.03621 · v2 · submitted 2025-04-04 · 💻 cs.CV

PILOT: A Promptable Interleaved Layout-aware OCR Transformer

Laziz Hamdi , Amine Tamasna , Pascal Boisson , Thierry Paquet This is my paper

Pith reviewed 2026-05-22 20:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords OCRdocument layout analysistransformer decoderpromptable generationhandwritten text recognitionspatial groundingend-to-end sequence model

0 comments

The pith

A single compact model jointly performs text recognition and spatial grounding on handwritten and printed documents via unified sequence generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether document OCR can be reformulated as autoregressive generation of a single interleaved stream of text subwords and quantized coordinate tokens, eliminating separate detection and segmentation stages. A 155M-parameter prompt-conditioned decoder on a CNN-encoded page image enables full-page transcription, region-conditioned extraction, and query-by-string spotting within one architecture. Training proceeds through a three-stage curriculum that first teaches plain transcription, then joint text-and-box output, and finally prompt control. Experiments across IAM, RIMES 2009, SROIE 2019, and MAURDOR show competitive recognition and line-level detection accuracy against traditional pipelines and larger models while remaining substantially smaller.

Core claim

PILOT formulates OCR as unified sequence generation in which a lightweight depthwise-separable CNN encodes the input page and a Transformer decoder autoregressively emits subword tokens interleaved with quantized absolute-coordinate tokens on a fixed 10 px grid; this single stream supports full-page OCR, region-conditioned reading, and query-by-string spotting when the model is conditioned on appropriate prompts.

What carries the argument

Unified autoregressive generation of interleaved subword and quantized absolute-coordinate tokens on a 10 px grid, conditioned on prompts.

If this is right

Full-page OCR, region-conditioned reading, and query-by-string spotting become possible inside one 155M-parameter decoder without task-specific heads.
A three-stage curriculum progressing from transcription to joint text-and-box generation to prompt control stabilizes training of the interleaved output.
Competitive recognition and detection accuracy is achieved on IAM, RIMES 2009, SROIE 2019, and heterogeneous MAURDOR while using far fewer parameters than billion-scale multimodal models.
Releasing the synthetic SROIE generator, 500k IDL/PDFA pages, and harmonized line-level annotations enables direct reproduction and further prompt-based experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fixed 10 px grid may limit precision on very small text or curved layouts, suggesting a possible extension to adaptive or finer quantization.
Because the same decoder handles both printed and handwritten input, the approach could be tested on mixed-language or degraded historical documents without retraining separate recognizers.
Prompt conditioning opens the door to interactive document agents that request only specific fields or answer layout-aware questions directly from the generated token stream.

Load-bearing premise

Autoregressive emission of interleaved text and fixed-grid coordinate tokens is sufficient to produce accurate spatial layout without separate detection stages or post-processing.

What would settle it

Measure whether line-level detection F1 on SROIE or MAURDOR drops below that of a standard two-stage OCR pipeline when the model is required to output both text and boxes from the same generated sequence.

Figures

Figures reproduced from arXiv: 2504.03621 by Amine Tamasna, Laziz Hamdi, Pascal Boisson, Thierry Paquet.

**Figure 2.** Figure 2: Overall architecture consists of a CNN vision encoder and a Transformer [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Samples from real datasets enriched with text line level annotations [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Synthetic samples Each point corresponds to a token that represents a quantized coordinate in the document. Notably, tokens that encode nearby positions in the document space cluster together in a curve following each other with a similar gap, reflecting a learned geometric ordering. This indicates that the model’s embedding layer captures spatial continuity: tokens for higher or lower coordinates on an ax… view at source ↗

**Figure 5.** Figure 5: t-SNE 2 dimensional representations of locations tokens embeddings Prompt: Read at 66, 184, 97, 186 Label:’e-mail:burgess@world.std.com.’ Prediction: ’usa’ Prompt:Read at 31, 139, 71, 145 Label:’exams moved to a’ Prediction:’evans moved to a’ [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of Region-Based OCR Predictions (errors are highlighted in [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of Content-Based text locatilization, in the left the label loca [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Classical OCR pipelines decompose document reading into detection, segmentation, and recognition stages, which makes them sensitive to localization errors and difficult to extend to interactive querying. This work investigates whether a single compact model can jointly perform text recognition and spatial grounding on both handwritten and printed documents. We introduce PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified sequence generation. A lightweight depthwise-separable CNN encodes the page, and a Transformer decoder autoregressively emits a single stream of subword and quantized absolute-coordinate tokens on a 10\,px grid, enabling full-page OCR, region-conditioned reading, and query-by-string spotting within the same architecture. A three-stage curriculum, progressing from plain transcription to joint text-and-box generation and finally to prompt-controlled extraction, stabilizes training and improves spatial grounding. Experiments on IAM, RIMES~2009, SROIE~2019, and the heterogeneous MAURDOR benchmark show that PILOT achieves competitive or superior performance in text recognition and line-level detection compared with traditional OCR systems, recent end-to-end HTR models, and compact vision--language models, while remaining substantially smaller than billion-scale multimodal models. Additional evaluations on fine-grained OCR and query-by-string spotting further confirm that a unified text--layout decoder can provide accurate and efficient promptable OCR in a compact setting. To support reproducibility, we release the synthetic SROIE generator, the 500k annotated IDL/PDFA pages, the harmonized line-level annotations for IAM, RIMES~2009, and MAURDOR, and the source code at https://github.com/hamdilaziz/PILOT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PILOT unifies OCR and layout in one promptable decoder on a 10px grid, with competitive results and good releases, but the quantization choice needs closer scrutiny for precision.

read the letter

The central thing to know is that PILOT is a 155M parameter model that unifies text recognition and layout prediction for both handwritten and printed documents in one promptable generative decoder. It encodes the page with a depthwise separable CNN and then autoregressively generates a sequence mixing subword tokens and absolute coordinates quantized to a 10 pixel grid. This lets the same model do full page OCR, region specific reading, or query by string spotting. The three stage curriculum helps the model learn the joint task step by step. On IAM, RIMES, SROIE and MAURDOR it matches or beats traditional pipelines and other end to end models while staying much smaller than billion parameter alternatives. The open release of code, synthetic data tools, and harmonized annotations stands out as useful for the community. The potential weakness is in the spatial grounding. The fixed 10px grid sets a lower bound on accuracy that could matter for small text or dense areas, and the work does not test whether a finer grid or continuous prediction would improve results. Without those ablations or error bars on the metrics, it is hard to tell how much the discretization affects the reported line level detection scores. The curriculum is described but its specific contribution to spatial performance is not broken out in detail. Readers working on document analysis or end to end vision language models for OCR would get the most from this. It offers a practical alternative to multi stage systems for those who value compactness and promptability. The paper engages honestly with the literature on HTR and OCR and presents a coherent architecture with supporting experiments. It is solid enough to go through peer review, where the main questions would be around the quantization choice and the strength of the spatial results. I recommend accepting it for review.

Referee Report

3 major / 2 minor

Summary. The manuscript presents PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified autoregressive sequence generation. A lightweight depthwise-separable CNN encodes the input page while a Transformer decoder emits interleaved subword tokens and quantized absolute-coordinate tokens on a fixed 10 px grid. Training proceeds via a three-stage curriculum (plain transcription, joint text-and-box generation, prompt-controlled extraction). Experiments report competitive or superior text recognition and line-level detection results on IAM, RIMES 2009, SROIE 2019, and the heterogeneous MAURDOR benchmark relative to traditional OCR pipelines, end-to-end HTR models, and compact vision-language models, with additional evaluations on fine-grained OCR and query-by-string spotting. Code, synthetic generators, and harmonized annotations are released.

Significance. If the empirical claims hold after addressing the noted gaps, the work would demonstrate that a single compact unified decoder can jointly handle recognition and spatial grounding for both handwritten and printed documents without separate detection or post-processing stages. This has potential to simplify interactive OCR pipelines. The public release of the synthetic SROIE generator, 500k annotated pages, harmonized line-level annotations, and source code strengthens reproducibility and downstream utility.

major comments (3)

[Abstract] Abstract (unified sequence generation paragraph): The central claim that autoregressive interleaving of subword and quantized absolute-coordinate tokens on a fixed 10 px grid suffices for accurate joint recognition and layout without separate detection stages lacks any ablation of grid resolution versus finer quantization or continuous regression. This discretization imposes a fixed ±5 px lower bound on localization error that is independent of model capacity and may compound in dense or sub-20 px text layouts, directly bearing on whether the reported line-level detection metrics demonstrate sufficiency.
[Experiments] Experiments section (benchmark results): Competitive performance is reported on IAM, RIMES 2009, SROIE 2019, and MAURDOR without error bars, standard deviations across multiple runs, or explicit discussion of how post-hoc curriculum choices were validated. This absence makes it difficult to assess whether the gains over baselines are robust or sensitive to random seeds and training details.
[Method] Method (three-stage curriculum description): The curriculum is stated to stabilize training and improve spatial grounding, yet no ablation isolates the contribution of each stage or the prompt conditioning to the final recognition and detection metrics. Without these controls, it remains unclear whether the reported results depend on this specific procedure or would hold under simpler training regimes.

minor comments (2)

[Abstract] The model-size comparison to billion-scale VLMs would be clearer if presented in a dedicated table rather than only in the abstract text.
[Method] Notation for the quantized coordinate tokens could include an explicit example sequence in the method figure to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where the manuscript can be strengthened without misrepresenting the original experiments.

read point-by-point responses

Referee: [Abstract] Abstract (unified sequence generation paragraph): The central claim that autoregressive interleaving of subword and quantized absolute-coordinate tokens on a fixed 10 px grid suffices for accurate joint recognition and layout without separate detection stages lacks any ablation of grid resolution versus finer quantization or continuous regression. This discretization imposes a fixed ±5 px lower bound on localization error that is independent of model capacity and may compound in dense or sub-20 px text layouts, directly bearing on whether the reported line-level detection metrics demonstrate sufficiency.

Authors: We thank the referee for this observation on the quantization design. The 10 px grid was selected after preliminary experiments to balance localization granularity against sequence length and training stability; finer grids rapidly increase token count and memory usage. Results on MAURDOR, which contains dense and small-text layouts, remain competitive with dedicated detectors, suggesting the bound is acceptable for the targeted use cases. We will add an explicit discussion of the ±5 px error floor and its implications in the revised manuscript. A comprehensive ablation against continuous regression or sub-10 px grids would require substantial new compute and is noted as future work rather than part of this revision. revision: partial
Referee: [Experiments] Experiments section (benchmark results): Competitive performance is reported on IAM, RIMES 2009, SROIE 2019, and MAURDOR without error bars, standard deviations across multiple runs, or explicit discussion of how post-hoc curriculum choices were validated. This absence makes it difficult to assess whether the gains over baselines are robust or sensitive to random seeds and training details.

Authors: We agree that variability measures improve interpretability. All reported numbers come from single training runs owing to the cost of 155 M-parameter models on the available hardware. We will expand the Experiments section with a description of how curriculum-stage hyperparameters were selected via validation-set monitoring and will add a brief statement on the single-run limitation. Where compute permits, we will include a small number of additional seeds in the supplementary material for the primary benchmarks. revision: partial
Referee: [Method] Method (three-stage curriculum description): The curriculum is stated to stabilize training and improve spatial grounding, yet no ablation isolates the contribution of each stage or the prompt conditioning to the final recognition and detection metrics. Without these controls, it remains unclear whether the reported results depend on this specific procedure or would hold under simpler training regimes.

Authors: We accept that isolating the curriculum stages would strengthen the claims. In the revised manuscript we will insert a new ablation table that reports recognition and detection metrics when training is performed with (i) only stage 1, (ii) stages 1+2, and (iii) the full three-stage schedule, as well as an additional row ablating prompt conditioning. This will clarify the incremental benefit of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architecture and evaluation

full rationale

The paper describes an empirical model (PILOT) whose performance claims rest on comparisons to external baselines on public datasets (IAM, RIMES, SROIE, MAURDOR). No equations or derivations are presented that reduce a claimed prediction to a quantity defined by the model's own fitted parameters. The architecture description (CNN encoder + autoregressive decoder emitting subword and 10 px quantized coordinate tokens) is a design choice, not a self-referential derivation. Curriculum stages and prompt conditioning are training procedures whose efficacy is measured externally. No self-citation chains are invoked to justify uniqueness or forbid alternatives. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer decoder assumptions and the modeling choice of a fixed 10 px quantization grid; no new physical entities or ad-hoc constants beyond typical ML hyperparameters are introduced.

free parameters (1)

10 px coordinate grid
Quantization resolution chosen to balance token count and spatial precision; directly affects coordinate token vocabulary and grounding accuracy.

axioms (1)

domain assumption Transformer decoder can jointly model text and spatial tokens autoregressively without separate localization head
Invoked in the unified sequence generation paragraph of the abstract.

pith-pipeline@v0.9.0 · 5842 in / 1356 out tokens · 23000 ms · 2026-05-22T20:53:28.769405+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

[1]

EAST: An Efficient and Accurate Scene Text Detector

Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: An Efficient and Accurate Scene Text Detector. Preprint at https://doi.org/10.48550/ arXiv.1704.03155 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

https://github.com/JaidedAI/EasyOCR (2020)

Jaided AI: EasyOCR: Ready-to-use Optical Character Recognition with Deep Learning. https://github.com/JaidedAI/EasyOCR (2020)

work page 2020
[3]

In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), vol

Smith, R.: An Overview of the Tesseract OCR Engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 629–633. IEEE (2007)

work page 2007
[4]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character Region Awareness for Text Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9365–9374 (2019)

work page 2019
[5]

https://github.com/PaddlePaddle/PaddleOCR (2021)

PaddlePaddle Community: PaddleOCR: An Open-Source Optical Character Recog- nitionToolBasedonPaddlePaddle. https://github.com/PaddlePaddle/PaddleOCR (2021)

work page 2021
[6]

N., Kaiser, Ł., Polosukhin, I.: Attention is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is All You Need. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30. Curran Associates, Inc. (2017)

work page 2017
[7]

Lay- outlm: Pre-training of text and layout for document image understanding

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Lay- outlm: Pre-training of text and layout for document image understanding. In: KDD 2020, page 1192–1200, New York, NY, USA. Association for Computing Machinery

work page 2020
[8]

LayoutLMv2: Multi-modal pre-training for visually-rich document understanding

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha hang, Wanxiang Che, Min Zhang, and Lidong Zhou.: 2021. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint C...

work page 2021
[9]

In: Proceedings of the 30th ACM International Conference on Multimedia, page 4083–4091, New York, NY, USA

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei.: Layoutlmv3: Pre- training for document ai with unified text and image masking. In: Proceedings of the 30th ACM International Conference on Multimedia, page 4083–4091, New York, NY, USA. Association for Computing Machinery, 2022

work page 2022
[10]

CoRR, abs/2108.04539

Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: BROS: A Layout- Aware Pre-trained Language Model for Understanding Documents. CoRR, abs/2108.04539. https://arxiv.org/abs/2108.04539 (2021)

work page arXiv 2021
[11]

In: Computer Vision – ECCV 2022, pp

Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: OCR-Free Document Understanding Transformer. In: Computer Vision – ECCV 2022, pp. 498–517. Springer Nature Switzerland (2022)

work page 2022
[12]

In: Proceedings of the 40th International Con- ference on Machine Learning (ICML), JMLR.org (2023)

Lee, K., Joshi, M., Turc, I., Hu, H., Liu, F., Eisenschlos, J., Khandelwal, U., Shaw, P., Chang, M.-W., Toutanova, K.: Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In: Proceedings of the 40th International Con- ference on Machine Learning (ICML), JMLR.org (2023)

work page 2023
[13]

IEEE Transactions on Pattern Analysis and Machine Intelligence, pp

Coquenet, D., Chatelain, C., Paquet, T.: DAN: A Segmentation-Free Document Attention Network for Handwritten Document Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–17 (2023)

work page 2023
[14]

IJDAR (2025)

Constum, T., Tranouez, P., Paquet, T.: DANIEL: A Fast Document Attention Net- work for Information Extraction and Labelling of Handwritten Documents. IJDAR (2025). Towards generative and interactive end to end OCR models 17

work page 2025
[15]

In Frontiers in Hand- writing Recognition (ICFHR), 2018 16th International Conference

Ares Oliveira, Sofia and Seguin, Benoit and Kaplan, Frederic.: dhSegment: A generic deep-learning approach for document segmentation. In Frontiers in Hand- writing Recognition (ICFHR), 2018 16th International Conference

work page 2018
[16]

Joan Puigcerver and Carlos Mocholí.: PyLaia 2018 https://github.com/ jpuigcerver/PyLaia

work page 2018
[17]

Results of the RIMES Evaluation Campaign for Handwritten Mail Processing

Emmanuèle Grosicki, Matthieu Carré, Jean-Marie Brodin, and Edouard Geoffrois. Results of the RIMES Evaluation Campaign for Handwritten Mail Processing. In 2009 10th International Conference on Document Analysis and Recognition, pages 941–945, July 2009

work page 2009
[18]

Brunessaux, P

S. Brunessaux, P. Giroux, B. Grilhères, M. Manta, M. Bodin, K. Choukri, O. Galibert, and J. Kahn.: The maurdor project: Improving automatic processing of digital documents In: International Workshop on Document Analysis Systems, 2014, pp. 349–354

work page 2014
[19]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florêncio, D., Zhang, C., Li, Z., Wei, F.: TrOCR: Transformer-Based Optical Character Recognition with Pre-Trained Models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (2023)

work page 2023
[20]

: Multilingual Denoising Pre- training for Neural Machine Translation In: Transactions of the Association for Computational Linguistics, 2020

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. : Multilingual Denoising Pre- training for Neural Machine Translation In: Transactions of the Association for Computational Linguistics, 2020

work page 2020
[21]

In Computer Vision – ECCV 2022 Workshops, pages 280–296, Cham, 2023

Brian Davis, Bryan Morse, Brian Price, Chris Tensmeyer, Curtis Wigington, and Vlad Morariu.: End-to-end document recognition and understanding with dessurt. In Computer Vision – ECCV 2022 Workshops, pages 280–296, Cham, 2023. Springer Nature Switzerland

work page 2022
[22]

In: Proceedings of ACL (2024)

Zheng, M., Feng, X., Si, Q., She, Q., Lin, Z., Jiang, W., Wang, W.: Multimodal Table Understanding. In: Proceedings of ACL (2024)

work page 2024
[23]

Lucas Beyer and Andreas Steiner and André Susano Pinto and Alexander Kolesnikov and Xiao Wang and Daniel Salz and Maxim Neumann and Ibrahim Alabdulmohsin and Michael Tschannen and Emanuele Bugliarello and Thomas Un- terthiner and Daniel Keysers and Skanda Koppula and Fangyu Liu and Adam Gryc- ner and Alexey Gritsenko and Neil Houlsby and Manoj Kumar and K...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol

Mao, Z., Bai, H., Hou, L., Shang, L., Jiang, X., Liu, Q., Wong, K.-F.: Visually Guided Generative Text-Layout Pre-training for Document Intelligence. In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol. 1, pp. 4713–4730 (2024)

work page 2024
[25]

Marti and H

U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1):39–46, November 2002

work page 2002
[26]

Kodym and M

O. Kodym and M. Hradiš. Page Layout Analysis System for Unconstrained His- toric Documents. International Conference on Document Analysis and Recognition (ICDAR), 2021

work page 2021
[27]

M. Kišš, K. Beneš, and M. Hradiš. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. International Conference on Document Analysis and Recognition (ICDAR), 2021. 18 L. Hamdi et al

work page 2021
[28]

Kohút and M

J. Kohút and M. Hradiš. TS-Net: OCR Trained to Switch Between Text Tran- scription Styles. International Conference on Document Analysis and Recognition (ICDAR), 2021

work page 2021
[29]

Laurens van der Maaten and Geoffrey Hinton Visualizing Data using t-SNE.Jour- nal of Machine Learning Research, 2008

work page 2008
[30]

ZhengHuang,KaiChen,JianhuaHe,XiangBai,DimosthenisKaratzas,ShijianLu, and C. V. Jawahar.: Icdar2019 competition on scanned receipt ocr and information extraction. In2019InternationalConferenceonDocumentAnalysisandRecognition (ICDAR), pages 1516–1520

work page
[31]

Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee.: Character region awareness for text detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pages 9357–9366

work page 2019
[32]

Yolov3: An incremental improvement, 2018

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018

work page 2018
[33]

Yousef and T

M. Yousef and T. E. Bishop. Origaminet: Weaklysupervised, segmentation-free, one-step, full page text recognition by learning to unfold. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, pages 14698–14707, Los Alamitos, CA, USA, jun 2020. IEEE Computer Society

work page 2020
[34]

Detecting text in natural image with connectionist text proposal

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting text in natural image with connectionist text proposal. InEuropean Conference on Computer Vi- sion, 2016 Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In

work page 2016
[35]

U-net: Convolutional net- works for biomedicalMICCAI, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net- works for biomedicalMICCAI, 2015

work page 2015
[36]

Lee and S

C. Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pages 2231–2239

work page 2016
[37]

In International Journal of Docu- ment Analysis and Recognition, vol

Wolf, C., Jolion, J.-M.: Object Count/Area Graphs for the Evaluation of Ob- ject Detection and Segmentation Algorithms. In International Journal of Docu- ment Analysis and Recognition, vol. 8, pp. 280–296 (2006). https://doi.org/10.1007/ s10032-006-0014-0

work page 2006
[38]

In: Proceedings of CVPR (2024)

Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., Ge, Z., Zhao, L., Sun, J., Peng, Y., Han, C., Zhang, X.: General OCR Theory: Towards OCR-2.0 via a Unified End-to-End Model. In: Proceedings of CVPR (2024). https://arxiv.org/abs/2409. 01704

work page 2024
[39]

Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu.: Florence-2: Advanc- ing a Unified Representation for a Variety of Vision Tasks In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

work page 2024
[40]

The main protocol for filtering the dataset is as follows: Towards generative and interactive end to end OCR models 19

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae.: Visual Instruction Tuning In NeurIPS 2023 A Dataset Construction Details A.1 PDFA and IDL Datasets We collected a set of real PDF documents and scanned images (filtered from the SafeDocs corpus and the Industry Documents Library). The main protocol for filtering the dataset is as follows: ...

work page 2023
[41]

Documents with dimensions larger than 2480 × 3508 are discarded or resized, as these dimensions cover the majority of standard documents

All PDFs are converted to images at 200 DPI. Documents with dimensions larger than 2480 × 3508 are discarded or resized, as these dimensions cover the majority of standard documents. Non-straight images are rectified

work page
[42]

To ensure dataset heterogeneity, we limit the number of documents with similar structural layouts

work page
[43]

For the IDL dataset, we employ multiple OCR systems capable of reading handwritten text, as these documents may contain a significant pro- portion of handwritten samples

For the PDFA dataset, PaddleOCR is used to extract text lines from all images. For the IDL dataset, we employ multiple OCR systems capable of reading handwritten text, as these documents may contain a significant pro- portion of handwritten samples

work page
[44]

Documents are further filtered based on their content (e.g., removal of non- Latin characters, empty content, or illegible text)

work page
[45]

See Figure 3 for sample images

To reduce computational time during pre-training, we resize all images so that the median height is 2200 pixels and the median width is 1700 pixels. See Figure 3 for sample images. A.2 IAM and RIMES 2009 Toobtaintextlinepositionannotations,weinitiallyusedsegmentationmodelsto generate pre-annotations. However, after matching the pre-annotations with text l...

work page 2009

[1] [1]

EAST: An Efficient and Accurate Scene Text Detector

Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: An Efficient and Accurate Scene Text Detector. Preprint at https://doi.org/10.48550/ arXiv.1704.03155 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

https://github.com/JaidedAI/EasyOCR (2020)

Jaided AI: EasyOCR: Ready-to-use Optical Character Recognition with Deep Learning. https://github.com/JaidedAI/EasyOCR (2020)

work page 2020

[3] [3]

In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), vol

Smith, R.: An Overview of the Tesseract OCR Engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 629–633. IEEE (2007)

work page 2007

[4] [4]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character Region Awareness for Text Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9365–9374 (2019)

work page 2019

[5] [5]

https://github.com/PaddlePaddle/PaddleOCR (2021)

PaddlePaddle Community: PaddleOCR: An Open-Source Optical Character Recog- nitionToolBasedonPaddlePaddle. https://github.com/PaddlePaddle/PaddleOCR (2021)

work page 2021

[6] [6]

N., Kaiser, Ł., Polosukhin, I.: Attention is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is All You Need. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30. Curran Associates, Inc. (2017)

work page 2017

[7] [7]

Lay- outlm: Pre-training of text and layout for document image understanding

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Lay- outlm: Pre-training of text and layout for document image understanding. In: KDD 2020, page 1192–1200, New York, NY, USA. Association for Computing Machinery

work page 2020

[8] [8]

LayoutLMv2: Multi-modal pre-training for visually-rich document understanding

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha hang, Wanxiang Che, Min Zhang, and Lidong Zhou.: 2021. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint C...

work page 2021

[9] [9]

In: Proceedings of the 30th ACM International Conference on Multimedia, page 4083–4091, New York, NY, USA

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei.: Layoutlmv3: Pre- training for document ai with unified text and image masking. In: Proceedings of the 30th ACM International Conference on Multimedia, page 4083–4091, New York, NY, USA. Association for Computing Machinery, 2022

work page 2022

[10] [10]

CoRR, abs/2108.04539

Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: BROS: A Layout- Aware Pre-trained Language Model for Understanding Documents. CoRR, abs/2108.04539. https://arxiv.org/abs/2108.04539 (2021)

work page arXiv 2021

[11] [11]

In: Computer Vision – ECCV 2022, pp

Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: OCR-Free Document Understanding Transformer. In: Computer Vision – ECCV 2022, pp. 498–517. Springer Nature Switzerland (2022)

work page 2022

[12] [12]

In: Proceedings of the 40th International Con- ference on Machine Learning (ICML), JMLR.org (2023)

Lee, K., Joshi, M., Turc, I., Hu, H., Liu, F., Eisenschlos, J., Khandelwal, U., Shaw, P., Chang, M.-W., Toutanova, K.: Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In: Proceedings of the 40th International Con- ference on Machine Learning (ICML), JMLR.org (2023)

work page 2023

[13] [13]

IEEE Transactions on Pattern Analysis and Machine Intelligence, pp

Coquenet, D., Chatelain, C., Paquet, T.: DAN: A Segmentation-Free Document Attention Network for Handwritten Document Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–17 (2023)

work page 2023

[14] [14]

IJDAR (2025)

Constum, T., Tranouez, P., Paquet, T.: DANIEL: A Fast Document Attention Net- work for Information Extraction and Labelling of Handwritten Documents. IJDAR (2025). Towards generative and interactive end to end OCR models 17

work page 2025

[15] [15]

In Frontiers in Hand- writing Recognition (ICFHR), 2018 16th International Conference

Ares Oliveira, Sofia and Seguin, Benoit and Kaplan, Frederic.: dhSegment: A generic deep-learning approach for document segmentation. In Frontiers in Hand- writing Recognition (ICFHR), 2018 16th International Conference

work page 2018

[16] [16]

Joan Puigcerver and Carlos Mocholí.: PyLaia 2018 https://github.com/ jpuigcerver/PyLaia

work page 2018

[17] [17]

Results of the RIMES Evaluation Campaign for Handwritten Mail Processing

Emmanuèle Grosicki, Matthieu Carré, Jean-Marie Brodin, and Edouard Geoffrois. Results of the RIMES Evaluation Campaign for Handwritten Mail Processing. In 2009 10th International Conference on Document Analysis and Recognition, pages 941–945, July 2009

work page 2009

[18] [18]

Brunessaux, P

S. Brunessaux, P. Giroux, B. Grilhères, M. Manta, M. Bodin, K. Choukri, O. Galibert, and J. Kahn.: The maurdor project: Improving automatic processing of digital documents In: International Workshop on Document Analysis Systems, 2014, pp. 349–354

work page 2014

[19] [19]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florêncio, D., Zhang, C., Li, Z., Wei, F.: TrOCR: Transformer-Based Optical Character Recognition with Pre-Trained Models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (2023)

work page 2023

[20] [20]

: Multilingual Denoising Pre- training for Neural Machine Translation In: Transactions of the Association for Computational Linguistics, 2020

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. : Multilingual Denoising Pre- training for Neural Machine Translation In: Transactions of the Association for Computational Linguistics, 2020

work page 2020

[21] [21]

In Computer Vision – ECCV 2022 Workshops, pages 280–296, Cham, 2023

Brian Davis, Bryan Morse, Brian Price, Chris Tensmeyer, Curtis Wigington, and Vlad Morariu.: End-to-end document recognition and understanding with dessurt. In Computer Vision – ECCV 2022 Workshops, pages 280–296, Cham, 2023. Springer Nature Switzerland

work page 2022

[22] [22]

In: Proceedings of ACL (2024)

Zheng, M., Feng, X., Si, Q., She, Q., Lin, Z., Jiang, W., Wang, W.: Multimodal Table Understanding. In: Proceedings of ACL (2024)

work page 2024

[23] [23]

Lucas Beyer and Andreas Steiner and André Susano Pinto and Alexander Kolesnikov and Xiao Wang and Daniel Salz and Maxim Neumann and Ibrahim Alabdulmohsin and Michael Tschannen and Emanuele Bugliarello and Thomas Un- terthiner and Daniel Keysers and Skanda Koppula and Fangyu Liu and Adam Gryc- ner and Alexey Gritsenko and Neil Houlsby and Manoj Kumar and K...

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol

Mao, Z., Bai, H., Hou, L., Shang, L., Jiang, X., Liu, Q., Wong, K.-F.: Visually Guided Generative Text-Layout Pre-training for Document Intelligence. In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol. 1, pp. 4713–4730 (2024)

work page 2024

[25] [25]

Marti and H

U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1):39–46, November 2002

work page 2002

[26] [26]

Kodym and M

O. Kodym and M. Hradiš. Page Layout Analysis System for Unconstrained His- toric Documents. International Conference on Document Analysis and Recognition (ICDAR), 2021

work page 2021

[27] [27]

M. Kišš, K. Beneš, and M. Hradiš. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. International Conference on Document Analysis and Recognition (ICDAR), 2021. 18 L. Hamdi et al

work page 2021

[28] [28]

Kohút and M

J. Kohút and M. Hradiš. TS-Net: OCR Trained to Switch Between Text Tran- scription Styles. International Conference on Document Analysis and Recognition (ICDAR), 2021

work page 2021

[29] [29]

Laurens van der Maaten and Geoffrey Hinton Visualizing Data using t-SNE.Jour- nal of Machine Learning Research, 2008

work page 2008

[30] [30]

ZhengHuang,KaiChen,JianhuaHe,XiangBai,DimosthenisKaratzas,ShijianLu, and C. V. Jawahar.: Icdar2019 competition on scanned receipt ocr and information extraction. In2019InternationalConferenceonDocumentAnalysisandRecognition (ICDAR), pages 1516–1520

work page

[31] [31]

Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee.: Character region awareness for text detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pages 9357–9366

work page 2019

[32] [32]

Yolov3: An incremental improvement, 2018

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018

work page 2018

[33] [33]

Yousef and T

M. Yousef and T. E. Bishop. Origaminet: Weaklysupervised, segmentation-free, one-step, full page text recognition by learning to unfold. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, pages 14698–14707, Los Alamitos, CA, USA, jun 2020. IEEE Computer Society

work page 2020

[34] [34]

Detecting text in natural image with connectionist text proposal

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting text in natural image with connectionist text proposal. InEuropean Conference on Computer Vi- sion, 2016 Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In

work page 2016

[35] [35]

U-net: Convolutional net- works for biomedicalMICCAI, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net- works for biomedicalMICCAI, 2015

work page 2015

[36] [36]

Lee and S

C. Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pages 2231–2239

work page 2016

[37] [37]

In International Journal of Docu- ment Analysis and Recognition, vol

Wolf, C., Jolion, J.-M.: Object Count/Area Graphs for the Evaluation of Ob- ject Detection and Segmentation Algorithms. In International Journal of Docu- ment Analysis and Recognition, vol. 8, pp. 280–296 (2006). https://doi.org/10.1007/ s10032-006-0014-0

work page 2006

[38] [38]

In: Proceedings of CVPR (2024)

Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., Ge, Z., Zhao, L., Sun, J., Peng, Y., Han, C., Zhang, X.: General OCR Theory: Towards OCR-2.0 via a Unified End-to-End Model. In: Proceedings of CVPR (2024). https://arxiv.org/abs/2409. 01704

work page 2024

[39] [39]

Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu.: Florence-2: Advanc- ing a Unified Representation for a Variety of Vision Tasks In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

work page 2024

[40] [40]

The main protocol for filtering the dataset is as follows: Towards generative and interactive end to end OCR models 19

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae.: Visual Instruction Tuning In NeurIPS 2023 A Dataset Construction Details A.1 PDFA and IDL Datasets We collected a set of real PDF documents and scanned images (filtered from the SafeDocs corpus and the Industry Documents Library). The main protocol for filtering the dataset is as follows: ...

work page 2023

[41] [41]

Documents with dimensions larger than 2480 × 3508 are discarded or resized, as these dimensions cover the majority of standard documents

All PDFs are converted to images at 200 DPI. Documents with dimensions larger than 2480 × 3508 are discarded or resized, as these dimensions cover the majority of standard documents. Non-straight images are rectified

work page

[42] [42]

To ensure dataset heterogeneity, we limit the number of documents with similar structural layouts

work page

[43] [43]

For the IDL dataset, we employ multiple OCR systems capable of reading handwritten text, as these documents may contain a significant pro- portion of handwritten samples

For the PDFA dataset, PaddleOCR is used to extract text lines from all images. For the IDL dataset, we employ multiple OCR systems capable of reading handwritten text, as these documents may contain a significant pro- portion of handwritten samples

work page

[44] [44]

Documents are further filtered based on their content (e.g., removal of non- Latin characters, empty content, or illegible text)

work page

[45] [45]

See Figure 3 for sample images

To reduce computational time during pre-training, we resize all images so that the median height is 2200 pixels and the median width is 1700 pixels. See Figure 3 for sample images. A.2 IAM and RIMES 2009 Toobtaintextlinepositionannotations,weinitiallyusedsegmentationmodelsto generate pre-annotations. However, after matching the pre-annotations with text l...

work page 2009