pith. sign in

arxiv: 2504.03621 · v2 · submitted 2025-04-04 · 💻 cs.CV

PILOT: A Promptable Interleaved Layout-aware OCR Transformer

Pith reviewed 2026-05-22 20:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords OCRdocument layout analysistransformer decoderpromptable generationhandwritten text recognitionspatial groundingend-to-end sequence model
0
0 comments X

The pith

A single compact model jointly performs text recognition and spatial grounding on handwritten and printed documents via unified sequence generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether document OCR can be reformulated as autoregressive generation of a single interleaved stream of text subwords and quantized coordinate tokens, eliminating separate detection and segmentation stages. A 155M-parameter prompt-conditioned decoder on a CNN-encoded page image enables full-page transcription, region-conditioned extraction, and query-by-string spotting within one architecture. Training proceeds through a three-stage curriculum that first teaches plain transcription, then joint text-and-box output, and finally prompt control. Experiments across IAM, RIMES 2009, SROIE 2019, and MAURDOR show competitive recognition and line-level detection accuracy against traditional pipelines and larger models while remaining substantially smaller.

Core claim

PILOT formulates OCR as unified sequence generation in which a lightweight depthwise-separable CNN encodes the input page and a Transformer decoder autoregressively emits subword tokens interleaved with quantized absolute-coordinate tokens on a fixed 10 px grid; this single stream supports full-page OCR, region-conditioned reading, and query-by-string spotting when the model is conditioned on appropriate prompts.

What carries the argument

Unified autoregressive generation of interleaved subword and quantized absolute-coordinate tokens on a 10 px grid, conditioned on prompts.

If this is right

  • Full-page OCR, region-conditioned reading, and query-by-string spotting become possible inside one 155M-parameter decoder without task-specific heads.
  • A three-stage curriculum progressing from transcription to joint text-and-box generation to prompt control stabilizes training of the interleaved output.
  • Competitive recognition and detection accuracy is achieved on IAM, RIMES 2009, SROIE 2019, and heterogeneous MAURDOR while using far fewer parameters than billion-scale multimodal models.
  • Releasing the synthetic SROIE generator, 500k IDL/PDFA pages, and harmonized line-level annotations enables direct reproduction and further prompt-based experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed 10 px grid may limit precision on very small text or curved layouts, suggesting a possible extension to adaptive or finer quantization.
  • Because the same decoder handles both printed and handwritten input, the approach could be tested on mixed-language or degraded historical documents without retraining separate recognizers.
  • Prompt conditioning opens the door to interactive document agents that request only specific fields or answer layout-aware questions directly from the generated token stream.

Load-bearing premise

Autoregressive emission of interleaved text and fixed-grid coordinate tokens is sufficient to produce accurate spatial layout without separate detection stages or post-processing.

What would settle it

Measure whether line-level detection F1 on SROIE or MAURDOR drops below that of a standard two-stage OCR pipeline when the model is required to output both text and boxes from the same generated sequence.

Figures

Figures reproduced from arXiv: 2504.03621 by Amine Tamasna, Laziz Hamdi, Pascal Boisson, Thierry Paquet.

Figure 1
Figure 1. Figure 1: Synthetic image with the corresponding OCR and locations transcription. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture consists of a CNN vision encoder and a Transformer [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Samples from real datasets enriched with text line level annotations [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Synthetic samples Each point corresponds to a token that represents a quantized coordinate in the document. Notably, tokens that encode nearby positions in the document space cluster together in a curve following each other with a similar gap, reflecting a learned geometric ordering. This indicates that the model’s embedding layer captures spatial continuity: tokens for higher or lower coordinates on an ax… view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE 2 dimensional representations of locations tokens embeddings Prompt: Read at 66, 184, 97, 186 La￾bel:’e-mail:burgess@world.std.com.’ Prediction: ’usa’ Prompt:Read at 31, 139, 71, 145 La￾bel:’exams moved to a’ Predic￾tion:’evans moved to a’ [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of Region-Based OCR Predictions (errors are highlighted in [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of Content-Based text locatilization, in the left the label loca [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
read the original abstract

Classical OCR pipelines decompose document reading into detection, segmentation, and recognition stages, which makes them sensitive to localization errors and difficult to extend to interactive querying. This work investigates whether a single compact model can jointly perform text recognition and spatial grounding on both handwritten and printed documents. We introduce PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified sequence generation. A lightweight depthwise-separable CNN encodes the page, and a Transformer decoder autoregressively emits a single stream of subword and quantized absolute-coordinate tokens on a 10\,px grid, enabling full-page OCR, region-conditioned reading, and query-by-string spotting within the same architecture. A three-stage curriculum, progressing from plain transcription to joint text-and-box generation and finally to prompt-controlled extraction, stabilizes training and improves spatial grounding. Experiments on IAM, RIMES~2009, SROIE~2019, and the heterogeneous MAURDOR benchmark show that PILOT achieves competitive or superior performance in text recognition and line-level detection compared with traditional OCR systems, recent end-to-end HTR models, and compact vision--language models, while remaining substantially smaller than billion-scale multimodal models. Additional evaluations on fine-grained OCR and query-by-string spotting further confirm that a unified text--layout decoder can provide accurate and efficient promptable OCR in a compact setting. To support reproducibility, we release the synthetic SROIE generator, the 500k annotated IDL/PDFA pages, the harmonized line-level annotations for IAM, RIMES~2009, and MAURDOR, and the source code at https://github.com/hamdilaziz/PILOT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified autoregressive sequence generation. A lightweight depthwise-separable CNN encodes the input page while a Transformer decoder emits interleaved subword tokens and quantized absolute-coordinate tokens on a fixed 10 px grid. Training proceeds via a three-stage curriculum (plain transcription, joint text-and-box generation, prompt-controlled extraction). Experiments report competitive or superior text recognition and line-level detection results on IAM, RIMES 2009, SROIE 2019, and the heterogeneous MAURDOR benchmark relative to traditional OCR pipelines, end-to-end HTR models, and compact vision-language models, with additional evaluations on fine-grained OCR and query-by-string spotting. Code, synthetic generators, and harmonized annotations are released.

Significance. If the empirical claims hold after addressing the noted gaps, the work would demonstrate that a single compact unified decoder can jointly handle recognition and spatial grounding for both handwritten and printed documents without separate detection or post-processing stages. This has potential to simplify interactive OCR pipelines. The public release of the synthetic SROIE generator, 500k annotated pages, harmonized line-level annotations, and source code strengthens reproducibility and downstream utility.

major comments (3)
  1. [Abstract] Abstract (unified sequence generation paragraph): The central claim that autoregressive interleaving of subword and quantized absolute-coordinate tokens on a fixed 10 px grid suffices for accurate joint recognition and layout without separate detection stages lacks any ablation of grid resolution versus finer quantization or continuous regression. This discretization imposes a fixed ±5 px lower bound on localization error that is independent of model capacity and may compound in dense or sub-20 px text layouts, directly bearing on whether the reported line-level detection metrics demonstrate sufficiency.
  2. [Experiments] Experiments section (benchmark results): Competitive performance is reported on IAM, RIMES 2009, SROIE 2019, and MAURDOR without error bars, standard deviations across multiple runs, or explicit discussion of how post-hoc curriculum choices were validated. This absence makes it difficult to assess whether the gains over baselines are robust or sensitive to random seeds and training details.
  3. [Method] Method (three-stage curriculum description): The curriculum is stated to stabilize training and improve spatial grounding, yet no ablation isolates the contribution of each stage or the prompt conditioning to the final recognition and detection metrics. Without these controls, it remains unclear whether the reported results depend on this specific procedure or would hold under simpler training regimes.
minor comments (2)
  1. [Abstract] The model-size comparison to billion-scale VLMs would be clearer if presented in a dedicated table rather than only in the abstract text.
  2. [Method] Notation for the quantized coordinate tokens could include an explicit example sequence in the method figure to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where the manuscript can be strengthened without misrepresenting the original experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract (unified sequence generation paragraph): The central claim that autoregressive interleaving of subword and quantized absolute-coordinate tokens on a fixed 10 px grid suffices for accurate joint recognition and layout without separate detection stages lacks any ablation of grid resolution versus finer quantization or continuous regression. This discretization imposes a fixed ±5 px lower bound on localization error that is independent of model capacity and may compound in dense or sub-20 px text layouts, directly bearing on whether the reported line-level detection metrics demonstrate sufficiency.

    Authors: We thank the referee for this observation on the quantization design. The 10 px grid was selected after preliminary experiments to balance localization granularity against sequence length and training stability; finer grids rapidly increase token count and memory usage. Results on MAURDOR, which contains dense and small-text layouts, remain competitive with dedicated detectors, suggesting the bound is acceptable for the targeted use cases. We will add an explicit discussion of the ±5 px error floor and its implications in the revised manuscript. A comprehensive ablation against continuous regression or sub-10 px grids would require substantial new compute and is noted as future work rather than part of this revision. revision: partial

  2. Referee: [Experiments] Experiments section (benchmark results): Competitive performance is reported on IAM, RIMES 2009, SROIE 2019, and MAURDOR without error bars, standard deviations across multiple runs, or explicit discussion of how post-hoc curriculum choices were validated. This absence makes it difficult to assess whether the gains over baselines are robust or sensitive to random seeds and training details.

    Authors: We agree that variability measures improve interpretability. All reported numbers come from single training runs owing to the cost of 155 M-parameter models on the available hardware. We will expand the Experiments section with a description of how curriculum-stage hyperparameters were selected via validation-set monitoring and will add a brief statement on the single-run limitation. Where compute permits, we will include a small number of additional seeds in the supplementary material for the primary benchmarks. revision: partial

  3. Referee: [Method] Method (three-stage curriculum description): The curriculum is stated to stabilize training and improve spatial grounding, yet no ablation isolates the contribution of each stage or the prompt conditioning to the final recognition and detection metrics. Without these controls, it remains unclear whether the reported results depend on this specific procedure or would hold under simpler training regimes.

    Authors: We accept that isolating the curriculum stages would strengthen the claims. In the revised manuscript we will insert a new ablation table that reports recognition and detection metrics when training is performed with (i) only stage 1, (ii) stages 1+2, and (iii) the full three-stage schedule, as well as an additional row ablating prompt conditioning. This will clarify the incremental benefit of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architecture and evaluation

full rationale

The paper describes an empirical model (PILOT) whose performance claims rest on comparisons to external baselines on public datasets (IAM, RIMES, SROIE, MAURDOR). No equations or derivations are presented that reduce a claimed prediction to a quantity defined by the model's own fitted parameters. The architecture description (CNN encoder + autoregressive decoder emitting subword and 10 px quantized coordinate tokens) is a design choice, not a self-referential derivation. Curriculum stages and prompt conditioning are training procedures whose efficacy is measured externally. No self-citation chains are invoked to justify uniqueness or forbid alternatives. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer decoder assumptions and the modeling choice of a fixed 10 px quantization grid; no new physical entities or ad-hoc constants beyond typical ML hyperparameters are introduced.

free parameters (1)
  • 10 px coordinate grid
    Quantization resolution chosen to balance token count and spatial precision; directly affects coordinate token vocabulary and grounding accuracy.
axioms (1)
  • domain assumption Transformer decoder can jointly model text and spatial tokens autoregressively without separate localization head
    Invoked in the unified sequence generation paragraph of the abstract.

pith-pipeline@v0.9.0 · 5842 in / 1356 out tokens · 23000 ms · 2026-05-22T20:53:28.769405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

  1. [1]

    EAST: An Efficient and Accurate Scene Text Detector

    Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: An Efficient and Accurate Scene Text Detector. Preprint at https://doi.org/10.48550/ arXiv.1704.03155 (2017)

  2. [2]

    https://github.com/JaidedAI/EasyOCR (2020)

    Jaided AI: EasyOCR: Ready-to-use Optical Character Recognition with Deep Learning. https://github.com/JaidedAI/EasyOCR (2020)

  3. [3]

    In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), vol

    Smith, R.: An Overview of the Tesseract OCR Engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 629–633. IEEE (2007)

  4. [4]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character Region Awareness for Text Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9365–9374 (2019)

  5. [5]

    https://github.com/PaddlePaddle/PaddleOCR (2021)

    PaddlePaddle Community: PaddleOCR: An Open-Source Optical Character Recog- nitionToolBasedonPaddlePaddle. https://github.com/PaddlePaddle/PaddleOCR (2021)

  6. [6]

    N., Kaiser, Ł., Polosukhin, I.: Attention is All You Need

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is All You Need. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30. Curran Associates, Inc. (2017)

  7. [7]

    Lay- outlm: Pre-training of text and layout for document image understanding

    Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Lay- outlm: Pre-training of text and layout for document image understanding. In: KDD 2020, page 1192–1200, New York, NY, USA. Association for Computing Machinery

  8. [8]

    LayoutLMv2: Multi-modal pre-training for visually-rich document understanding

    Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha hang, Wanxiang Che, Min Zhang, and Lidong Zhou.: 2021. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint C...

  9. [9]

    In: Proceedings of the 30th ACM International Conference on Multimedia, page 4083–4091, New York, NY, USA

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei.: Layoutlmv3: Pre- training for document ai with unified text and image masking. In: Proceedings of the 30th ACM International Conference on Multimedia, page 4083–4091, New York, NY, USA. Association for Computing Machinery, 2022

  10. [10]

    CoRR, abs/2108.04539

    Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: BROS: A Layout- Aware Pre-trained Language Model for Understanding Documents. CoRR, abs/2108.04539. https://arxiv.org/abs/2108.04539 (2021)

  11. [11]

    In: Computer Vision – ECCV 2022, pp

    Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: OCR-Free Document Understanding Transformer. In: Computer Vision – ECCV 2022, pp. 498–517. Springer Nature Switzerland (2022)

  12. [12]

    In: Proceedings of the 40th International Con- ference on Machine Learning (ICML), JMLR.org (2023)

    Lee, K., Joshi, M., Turc, I., Hu, H., Liu, F., Eisenschlos, J., Khandelwal, U., Shaw, P., Chang, M.-W., Toutanova, K.: Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In: Proceedings of the 40th International Con- ference on Machine Learning (ICML), JMLR.org (2023)

  13. [13]

    IEEE Transactions on Pattern Analysis and Machine Intelligence, pp

    Coquenet, D., Chatelain, C., Paquet, T.: DAN: A Segmentation-Free Document Attention Network for Handwritten Document Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–17 (2023)

  14. [14]

    IJDAR (2025)

    Constum, T., Tranouez, P., Paquet, T.: DANIEL: A Fast Document Attention Net- work for Information Extraction and Labelling of Handwritten Documents. IJDAR (2025). Towards generative and interactive end to end OCR models 17

  15. [15]

    In Frontiers in Hand- writing Recognition (ICFHR), 2018 16th International Conference

    Ares Oliveira, Sofia and Seguin, Benoit and Kaplan, Frederic.: dhSegment: A generic deep-learning approach for document segmentation. In Frontiers in Hand- writing Recognition (ICFHR), 2018 16th International Conference

  16. [16]

    Joan Puigcerver and Carlos Mocholí.: PyLaia 2018 https://github.com/ jpuigcerver/PyLaia

  17. [17]

    Results of the RIMES Evaluation Campaign for Handwritten Mail Processing

    Emmanuèle Grosicki, Matthieu Carré, Jean-Marie Brodin, and Edouard Geoffrois. Results of the RIMES Evaluation Campaign for Handwritten Mail Processing. In 2009 10th International Conference on Document Analysis and Recognition, pages 941–945, July 2009

  18. [18]

    Brunessaux, P

    S. Brunessaux, P. Giroux, B. Grilhères, M. Manta, M. Bodin, K. Choukri, O. Galibert, and J. Kahn.: The maurdor project: Improving automatic processing of digital documents In: International Workshop on Document Analysis Systems, 2014, pp. 349–354

  19. [19]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florêncio, D., Zhang, C., Li, Z., Wei, F.: TrOCR: Transformer-Based Optical Character Recognition with Pre-Trained Models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (2023)

  20. [20]

    : Multilingual Denoising Pre- training for Neural Machine Translation In: Transactions of the Association for Computational Linguistics, 2020

    Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. : Multilingual Denoising Pre- training for Neural Machine Translation In: Transactions of the Association for Computational Linguistics, 2020

  21. [21]

    In Computer Vision – ECCV 2022 Workshops, pages 280–296, Cham, 2023

    Brian Davis, Bryan Morse, Brian Price, Chris Tensmeyer, Curtis Wigington, and Vlad Morariu.: End-to-end document recognition and understanding with dessurt. In Computer Vision – ECCV 2022 Workshops, pages 280–296, Cham, 2023. Springer Nature Switzerland

  22. [22]

    In: Proceedings of ACL (2024)

    Zheng, M., Feng, X., Si, Q., She, Q., Lin, Z., Jiang, W., Wang, W.: Multimodal Table Understanding. In: Proceedings of ACL (2024)

  23. [23]

    Lucas Beyer and Andreas Steiner and André Susano Pinto and Alexander Kolesnikov and Xiao Wang and Daniel Salz and Maxim Neumann and Ibrahim Alabdulmohsin and Michael Tschannen and Emanuele Bugliarello and Thomas Un- terthiner and Daniel Keysers and Skanda Koppula and Fangyu Liu and Adam Gryc- ner and Alexey Gritsenko and Neil Houlsby and Manoj Kumar and K...

  24. [24]

    In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol

    Mao, Z., Bai, H., Hou, L., Shang, L., Jiang, X., Liu, Q., Wong, K.-F.: Visually Guided Generative Text-Layout Pre-training for Document Intelligence. In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol. 1, pp. 4713–4730 (2024)

  25. [25]

    Marti and H

    U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1):39–46, November 2002

  26. [26]

    Kodym and M

    O. Kodym and M. Hradiš. Page Layout Analysis System for Unconstrained His- toric Documents. International Conference on Document Analysis and Recognition (ICDAR), 2021

  27. [27]

    M. Kišš, K. Beneš, and M. Hradiš. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. International Conference on Document Analysis and Recognition (ICDAR), 2021. 18 L. Hamdi et al

  28. [28]

    Kohút and M

    J. Kohút and M. Hradiš. TS-Net: OCR Trained to Switch Between Text Tran- scription Styles. International Conference on Document Analysis and Recognition (ICDAR), 2021

  29. [29]

    Laurens van der Maaten and Geoffrey Hinton Visualizing Data using t-SNE.Jour- nal of Machine Learning Research, 2008

  30. [30]

    ZhengHuang,KaiChen,JianhuaHe,XiangBai,DimosthenisKaratzas,ShijianLu, and C. V. Jawahar.: Icdar2019 competition on scanned receipt ocr and information extraction. In2019InternationalConferenceonDocumentAnalysisandRecognition (ICDAR), pages 1516–1520

  31. [31]

    Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee.: Character region awareness for text detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pages 9357–9366

  32. [32]

    Yolov3: An incremental improvement, 2018

    Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018

  33. [33]

    Yousef and T

    M. Yousef and T. E. Bishop. Origaminet: Weaklysupervised, segmentation-free, one-step, full page text recognition by learning to unfold. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, pages 14698–14707, Los Alamitos, CA, USA, jun 2020. IEEE Computer Society

  34. [34]

    Detecting text in natural image with connectionist text proposal

    Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting text in natural image with connectionist text proposal. InEuropean Conference on Computer Vi- sion, 2016 Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In

  35. [35]

    U-net: Convolutional net- works for biomedicalMICCAI, 2015

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net- works for biomedicalMICCAI, 2015

  36. [36]

    Lee and S

    C. Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pages 2231–2239

  37. [37]

    In International Journal of Docu- ment Analysis and Recognition, vol

    Wolf, C., Jolion, J.-M.: Object Count/Area Graphs for the Evaluation of Ob- ject Detection and Segmentation Algorithms. In International Journal of Docu- ment Analysis and Recognition, vol. 8, pp. 280–296 (2006). https://doi.org/10.1007/ s10032-006-0014-0

  38. [38]

    In: Proceedings of CVPR (2024)

    Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., Ge, Z., Zhao, L., Sun, J., Peng, Y., Han, C., Zhang, X.: General OCR Theory: Towards OCR-2.0 via a Unified End-to-End Model. In: Proceedings of CVPR (2024). https://arxiv.org/abs/2409. 01704

  39. [39]

    Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu.: Florence-2: Advanc- ing a Unified Representation for a Variety of Vision Tasks In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

  40. [40]

    The main protocol for filtering the dataset is as follows: Towards generative and interactive end to end OCR models 19

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae.: Visual Instruction Tuning In NeurIPS 2023 A Dataset Construction Details A.1 PDFA and IDL Datasets We collected a set of real PDF documents and scanned images (filtered from the SafeDocs corpus and the Industry Documents Library). The main protocol for filtering the dataset is as follows: ...

  41. [41]

    Documents with dimensions larger than 2480 × 3508 are discarded or resized, as these dimensions cover the majority of standard documents

    All PDFs are converted to images at 200 DPI. Documents with dimensions larger than 2480 × 3508 are discarded or resized, as these dimensions cover the majority of standard documents. Non-straight images are rectified

  42. [42]

    To ensure dataset heterogeneity, we limit the number of documents with similar structural layouts

  43. [43]

    For the IDL dataset, we employ multiple OCR systems capable of reading handwritten text, as these documents may contain a significant pro- portion of handwritten samples

    For the PDFA dataset, PaddleOCR is used to extract text lines from all images. For the IDL dataset, we employ multiple OCR systems capable of reading handwritten text, as these documents may contain a significant pro- portion of handwritten samples

  44. [44]

    Documents are further filtered based on their content (e.g., removal of non- Latin characters, empty content, or illegible text)

  45. [45]

    See Figure 3 for sample images

    To reduce computational time during pre-training, we resize all images so that the median height is 2200 pixels and the median width is 1700 pixels. See Figure 3 for sample images. A.2 IAM and RIMES 2009 Toobtaintextlinepositionannotations,weinitiallyusedsegmentationmodelsto generate pre-annotations. However, after matching the pre-annotations with text l...