PILOT: A Promptable Interleaved Layout-aware OCR Transformer
Pith reviewed 2026-05-22 20:53 UTC · model grok-4.3
The pith
A single compact model jointly performs text recognition and spatial grounding on handwritten and printed documents via unified sequence generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PILOT formulates OCR as unified sequence generation in which a lightweight depthwise-separable CNN encodes the input page and a Transformer decoder autoregressively emits subword tokens interleaved with quantized absolute-coordinate tokens on a fixed 10 px grid; this single stream supports full-page OCR, region-conditioned reading, and query-by-string spotting when the model is conditioned on appropriate prompts.
What carries the argument
Unified autoregressive generation of interleaved subword and quantized absolute-coordinate tokens on a 10 px grid, conditioned on prompts.
If this is right
- Full-page OCR, region-conditioned reading, and query-by-string spotting become possible inside one 155M-parameter decoder without task-specific heads.
- A three-stage curriculum progressing from transcription to joint text-and-box generation to prompt control stabilizes training of the interleaved output.
- Competitive recognition and detection accuracy is achieved on IAM, RIMES 2009, SROIE 2019, and heterogeneous MAURDOR while using far fewer parameters than billion-scale multimodal models.
- Releasing the synthetic SROIE generator, 500k IDL/PDFA pages, and harmonized line-level annotations enables direct reproduction and further prompt-based experiments.
Where Pith is reading between the lines
- The fixed 10 px grid may limit precision on very small text or curved layouts, suggesting a possible extension to adaptive or finer quantization.
- Because the same decoder handles both printed and handwritten input, the approach could be tested on mixed-language or degraded historical documents without retraining separate recognizers.
- Prompt conditioning opens the door to interactive document agents that request only specific fields or answer layout-aware questions directly from the generated token stream.
Load-bearing premise
Autoregressive emission of interleaved text and fixed-grid coordinate tokens is sufficient to produce accurate spatial layout without separate detection stages or post-processing.
What would settle it
Measure whether line-level detection F1 on SROIE or MAURDOR drops below that of a standard two-stage OCR pipeline when the model is required to output both text and boxes from the same generated sequence.
Figures
read the original abstract
Classical OCR pipelines decompose document reading into detection, segmentation, and recognition stages, which makes them sensitive to localization errors and difficult to extend to interactive querying. This work investigates whether a single compact model can jointly perform text recognition and spatial grounding on both handwritten and printed documents. We introduce PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified sequence generation. A lightweight depthwise-separable CNN encodes the page, and a Transformer decoder autoregressively emits a single stream of subword and quantized absolute-coordinate tokens on a 10\,px grid, enabling full-page OCR, region-conditioned reading, and query-by-string spotting within the same architecture. A three-stage curriculum, progressing from plain transcription to joint text-and-box generation and finally to prompt-controlled extraction, stabilizes training and improves spatial grounding. Experiments on IAM, RIMES~2009, SROIE~2019, and the heterogeneous MAURDOR benchmark show that PILOT achieves competitive or superior performance in text recognition and line-level detection compared with traditional OCR systems, recent end-to-end HTR models, and compact vision--language models, while remaining substantially smaller than billion-scale multimodal models. Additional evaluations on fine-grained OCR and query-by-string spotting further confirm that a unified text--layout decoder can provide accurate and efficient promptable OCR in a compact setting. To support reproducibility, we release the synthetic SROIE generator, the 500k annotated IDL/PDFA pages, the harmonized line-level annotations for IAM, RIMES~2009, and MAURDOR, and the source code at https://github.com/hamdilaziz/PILOT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified autoregressive sequence generation. A lightweight depthwise-separable CNN encodes the input page while a Transformer decoder emits interleaved subword tokens and quantized absolute-coordinate tokens on a fixed 10 px grid. Training proceeds via a three-stage curriculum (plain transcription, joint text-and-box generation, prompt-controlled extraction). Experiments report competitive or superior text recognition and line-level detection results on IAM, RIMES 2009, SROIE 2019, and the heterogeneous MAURDOR benchmark relative to traditional OCR pipelines, end-to-end HTR models, and compact vision-language models, with additional evaluations on fine-grained OCR and query-by-string spotting. Code, synthetic generators, and harmonized annotations are released.
Significance. If the empirical claims hold after addressing the noted gaps, the work would demonstrate that a single compact unified decoder can jointly handle recognition and spatial grounding for both handwritten and printed documents without separate detection or post-processing stages. This has potential to simplify interactive OCR pipelines. The public release of the synthetic SROIE generator, 500k annotated pages, harmonized line-level annotations, and source code strengthens reproducibility and downstream utility.
major comments (3)
- [Abstract] Abstract (unified sequence generation paragraph): The central claim that autoregressive interleaving of subword and quantized absolute-coordinate tokens on a fixed 10 px grid suffices for accurate joint recognition and layout without separate detection stages lacks any ablation of grid resolution versus finer quantization or continuous regression. This discretization imposes a fixed ±5 px lower bound on localization error that is independent of model capacity and may compound in dense or sub-20 px text layouts, directly bearing on whether the reported line-level detection metrics demonstrate sufficiency.
- [Experiments] Experiments section (benchmark results): Competitive performance is reported on IAM, RIMES 2009, SROIE 2019, and MAURDOR without error bars, standard deviations across multiple runs, or explicit discussion of how post-hoc curriculum choices were validated. This absence makes it difficult to assess whether the gains over baselines are robust or sensitive to random seeds and training details.
- [Method] Method (three-stage curriculum description): The curriculum is stated to stabilize training and improve spatial grounding, yet no ablation isolates the contribution of each stage or the prompt conditioning to the final recognition and detection metrics. Without these controls, it remains unclear whether the reported results depend on this specific procedure or would hold under simpler training regimes.
minor comments (2)
- [Abstract] The model-size comparison to billion-scale VLMs would be clearer if presented in a dedicated table rather than only in the abstract text.
- [Method] Notation for the quantized coordinate tokens could include an explicit example sequence in the method figure to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where the manuscript can be strengthened without misrepresenting the original experiments.
read point-by-point responses
-
Referee: [Abstract] Abstract (unified sequence generation paragraph): The central claim that autoregressive interleaving of subword and quantized absolute-coordinate tokens on a fixed 10 px grid suffices for accurate joint recognition and layout without separate detection stages lacks any ablation of grid resolution versus finer quantization or continuous regression. This discretization imposes a fixed ±5 px lower bound on localization error that is independent of model capacity and may compound in dense or sub-20 px text layouts, directly bearing on whether the reported line-level detection metrics demonstrate sufficiency.
Authors: We thank the referee for this observation on the quantization design. The 10 px grid was selected after preliminary experiments to balance localization granularity against sequence length and training stability; finer grids rapidly increase token count and memory usage. Results on MAURDOR, which contains dense and small-text layouts, remain competitive with dedicated detectors, suggesting the bound is acceptable for the targeted use cases. We will add an explicit discussion of the ±5 px error floor and its implications in the revised manuscript. A comprehensive ablation against continuous regression or sub-10 px grids would require substantial new compute and is noted as future work rather than part of this revision. revision: partial
-
Referee: [Experiments] Experiments section (benchmark results): Competitive performance is reported on IAM, RIMES 2009, SROIE 2019, and MAURDOR without error bars, standard deviations across multiple runs, or explicit discussion of how post-hoc curriculum choices were validated. This absence makes it difficult to assess whether the gains over baselines are robust or sensitive to random seeds and training details.
Authors: We agree that variability measures improve interpretability. All reported numbers come from single training runs owing to the cost of 155 M-parameter models on the available hardware. We will expand the Experiments section with a description of how curriculum-stage hyperparameters were selected via validation-set monitoring and will add a brief statement on the single-run limitation. Where compute permits, we will include a small number of additional seeds in the supplementary material for the primary benchmarks. revision: partial
-
Referee: [Method] Method (three-stage curriculum description): The curriculum is stated to stabilize training and improve spatial grounding, yet no ablation isolates the contribution of each stage or the prompt conditioning to the final recognition and detection metrics. Without these controls, it remains unclear whether the reported results depend on this specific procedure or would hold under simpler training regimes.
Authors: We accept that isolating the curriculum stages would strengthen the claims. In the revised manuscript we will insert a new ablation table that reports recognition and detection metrics when training is performed with (i) only stage 1, (ii) stages 1+2, and (iii) the full three-stage schedule, as well as an additional row ablating prompt conditioning. This will clarify the incremental benefit of each component. revision: yes
Circularity Check
No significant circularity in empirical architecture and evaluation
full rationale
The paper describes an empirical model (PILOT) whose performance claims rest on comparisons to external baselines on public datasets (IAM, RIMES, SROIE, MAURDOR). No equations or derivations are presented that reduce a claimed prediction to a quantity defined by the model's own fitted parameters. The architecture description (CNN encoder + autoregressive decoder emitting subword and 10 px quantized coordinate tokens) is a design choice, not a self-referential derivation. Curriculum stages and prompt conditioning are training procedures whose efficacy is measured externally. No self-citation chains are invoked to justify uniqueness or forbid alternatives. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- 10 px coordinate grid
axioms (1)
- domain assumption Transformer decoder can jointly model text and spatial tokens autoregressively without separate localization head
Reference graph
Works this paper leans on
-
[1]
EAST: An Efficient and Accurate Scene Text Detector
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: An Efficient and Accurate Scene Text Detector. Preprint at https://doi.org/10.48550/ arXiv.1704.03155 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
https://github.com/JaidedAI/EasyOCR (2020)
Jaided AI: EasyOCR: Ready-to-use Optical Character Recognition with Deep Learning. https://github.com/JaidedAI/EasyOCR (2020)
work page 2020
-
[3]
Smith, R.: An Overview of the Tesseract OCR Engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 629–633. IEEE (2007)
work page 2007
-
[4]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character Region Awareness for Text Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9365–9374 (2019)
work page 2019
-
[5]
https://github.com/PaddlePaddle/PaddleOCR (2021)
PaddlePaddle Community: PaddleOCR: An Open-Source Optical Character Recog- nitionToolBasedonPaddlePaddle. https://github.com/PaddlePaddle/PaddleOCR (2021)
work page 2021
-
[6]
N., Kaiser, Ł., Polosukhin, I.: Attention is All You Need
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is All You Need. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30. Curran Associates, Inc. (2017)
work page 2017
-
[7]
Lay- outlm: Pre-training of text and layout for document image understanding
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Lay- outlm: Pre-training of text and layout for document image understanding. In: KDD 2020, page 1192–1200, New York, NY, USA. Association for Computing Machinery
work page 2020
-
[8]
LayoutLMv2: Multi-modal pre-training for visually-rich document understanding
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha hang, Wanxiang Che, Min Zhang, and Lidong Zhou.: 2021. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint C...
work page 2021
-
[9]
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei.: Layoutlmv3: Pre- training for document ai with unified text and image masking. In: Proceedings of the 30th ACM International Conference on Multimedia, page 4083–4091, New York, NY, USA. Association for Computing Machinery, 2022
work page 2022
-
[10]
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: BROS: A Layout- Aware Pre-trained Language Model for Understanding Documents. CoRR, abs/2108.04539. https://arxiv.org/abs/2108.04539 (2021)
-
[11]
In: Computer Vision – ECCV 2022, pp
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: OCR-Free Document Understanding Transformer. In: Computer Vision – ECCV 2022, pp. 498–517. Springer Nature Switzerland (2022)
work page 2022
-
[12]
In: Proceedings of the 40th International Con- ference on Machine Learning (ICML), JMLR.org (2023)
Lee, K., Joshi, M., Turc, I., Hu, H., Liu, F., Eisenschlos, J., Khandelwal, U., Shaw, P., Chang, M.-W., Toutanova, K.: Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In: Proceedings of the 40th International Con- ference on Machine Learning (ICML), JMLR.org (2023)
work page 2023
-
[13]
IEEE Transactions on Pattern Analysis and Machine Intelligence, pp
Coquenet, D., Chatelain, C., Paquet, T.: DAN: A Segmentation-Free Document Attention Network for Handwritten Document Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–17 (2023)
work page 2023
-
[14]
Constum, T., Tranouez, P., Paquet, T.: DANIEL: A Fast Document Attention Net- work for Information Extraction and Labelling of Handwritten Documents. IJDAR (2025). Towards generative and interactive end to end OCR models 17
work page 2025
-
[15]
In Frontiers in Hand- writing Recognition (ICFHR), 2018 16th International Conference
Ares Oliveira, Sofia and Seguin, Benoit and Kaplan, Frederic.: dhSegment: A generic deep-learning approach for document segmentation. In Frontiers in Hand- writing Recognition (ICFHR), 2018 16th International Conference
work page 2018
-
[16]
Joan Puigcerver and Carlos Mocholí.: PyLaia 2018 https://github.com/ jpuigcerver/PyLaia
work page 2018
-
[17]
Results of the RIMES Evaluation Campaign for Handwritten Mail Processing
Emmanuèle Grosicki, Matthieu Carré, Jean-Marie Brodin, and Edouard Geoffrois. Results of the RIMES Evaluation Campaign for Handwritten Mail Processing. In 2009 10th International Conference on Document Analysis and Recognition, pages 941–945, July 2009
work page 2009
-
[18]
S. Brunessaux, P. Giroux, B. Grilhères, M. Manta, M. Bodin, K. Choukri, O. Galibert, and J. Kahn.: The maurdor project: Improving automatic processing of digital documents In: International Workshop on Document Analysis Systems, 2014, pp. 349–354
work page 2014
-
[19]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florêncio, D., Zhang, C., Li, Z., Wei, F.: TrOCR: Transformer-Based Optical Character Recognition with Pre-Trained Models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (2023)
work page 2023
-
[20]
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. : Multilingual Denoising Pre- training for Neural Machine Translation In: Transactions of the Association for Computational Linguistics, 2020
work page 2020
-
[21]
In Computer Vision – ECCV 2022 Workshops, pages 280–296, Cham, 2023
Brian Davis, Bryan Morse, Brian Price, Chris Tensmeyer, Curtis Wigington, and Vlad Morariu.: End-to-end document recognition and understanding with dessurt. In Computer Vision – ECCV 2022 Workshops, pages 280–296, Cham, 2023. Springer Nature Switzerland
work page 2022
-
[22]
Zheng, M., Feng, X., Si, Q., She, Q., Lin, Z., Jiang, W., Wang, W.: Multimodal Table Understanding. In: Proceedings of ACL (2024)
work page 2024
-
[23]
Lucas Beyer and Andreas Steiner and André Susano Pinto and Alexander Kolesnikov and Xiao Wang and Daniel Salz and Maxim Neumann and Ibrahim Alabdulmohsin and Michael Tschannen and Emanuele Bugliarello and Thomas Un- terthiner and Daniel Keysers and Skanda Koppula and Fangyu Liu and Adam Gryc- ner and Alexey Gritsenko and Neil Houlsby and Manoj Kumar and K...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Mao, Z., Bai, H., Hou, L., Shang, L., Jiang, X., Liu, Q., Wong, K.-F.: Visually Guided Generative Text-Layout Pre-training for Document Intelligence. In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol. 1, pp. 4713–4730 (2024)
work page 2024
-
[25]
U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1):39–46, November 2002
work page 2002
-
[26]
O. Kodym and M. Hradiš. Page Layout Analysis System for Unconstrained His- toric Documents. International Conference on Document Analysis and Recognition (ICDAR), 2021
work page 2021
-
[27]
M. Kišš, K. Beneš, and M. Hradiš. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. International Conference on Document Analysis and Recognition (ICDAR), 2021. 18 L. Hamdi et al
work page 2021
-
[28]
J. Kohút and M. Hradiš. TS-Net: OCR Trained to Switch Between Text Tran- scription Styles. International Conference on Document Analysis and Recognition (ICDAR), 2021
work page 2021
-
[29]
Laurens van der Maaten and Geoffrey Hinton Visualizing Data using t-SNE.Jour- nal of Machine Learning Research, 2008
work page 2008
-
[30]
ZhengHuang,KaiChen,JianhuaHe,XiangBai,DimosthenisKaratzas,ShijianLu, and C. V. Jawahar.: Icdar2019 competition on scanned receipt ocr and information extraction. In2019InternationalConferenceonDocumentAnalysisandRecognition (ICDAR), pages 1516–1520
-
[31]
Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee.: Character region awareness for text detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pages 9357–9366
work page 2019
-
[32]
Yolov3: An incremental improvement, 2018
Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018
work page 2018
-
[33]
M. Yousef and T. E. Bishop. Origaminet: Weaklysupervised, segmentation-free, one-step, full page text recognition by learning to unfold. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, pages 14698–14707, Los Alamitos, CA, USA, jun 2020. IEEE Computer Society
work page 2020
-
[34]
Detecting text in natural image with connectionist text proposal
Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting text in natural image with connectionist text proposal. InEuropean Conference on Computer Vi- sion, 2016 Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In
work page 2016
-
[35]
U-net: Convolutional net- works for biomedicalMICCAI, 2015
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net- works for biomedicalMICCAI, 2015
work page 2015
- [36]
-
[37]
In International Journal of Docu- ment Analysis and Recognition, vol
Wolf, C., Jolion, J.-M.: Object Count/Area Graphs for the Evaluation of Ob- ject Detection and Segmentation Algorithms. In International Journal of Docu- ment Analysis and Recognition, vol. 8, pp. 280–296 (2006). https://doi.org/10.1007/ s10032-006-0014-0
work page 2006
-
[38]
In: Proceedings of CVPR (2024)
Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., Ge, Z., Zhao, L., Sun, J., Peng, Y., Han, C., Zhang, X.: General OCR Theory: Towards OCR-2.0 via a Unified End-to-End Model. In: Proceedings of CVPR (2024). https://arxiv.org/abs/2409. 01704
work page 2024
-
[39]
Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu.: Florence-2: Advanc- ing a Unified Representation for a Variety of Vision Tasks In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024
work page 2024
-
[40]
Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae.: Visual Instruction Tuning In NeurIPS 2023 A Dataset Construction Details A.1 PDFA and IDL Datasets We collected a set of real PDF documents and scanned images (filtered from the SafeDocs corpus and the Industry Documents Library). The main protocol for filtering the dataset is as follows: ...
work page 2023
-
[41]
All PDFs are converted to images at 200 DPI. Documents with dimensions larger than 2480 × 3508 are discarded or resized, as these dimensions cover the majority of standard documents. Non-straight images are rectified
-
[42]
To ensure dataset heterogeneity, we limit the number of documents with similar structural layouts
-
[43]
For the PDFA dataset, PaddleOCR is used to extract text lines from all images. For the IDL dataset, we employ multiple OCR systems capable of reading handwritten text, as these documents may contain a significant pro- portion of handwritten samples
-
[44]
Documents are further filtered based on their content (e.g., removal of non- Latin characters, empty content, or illegible text)
-
[45]
See Figure 3 for sample images
To reduce computational time during pre-training, we resize all images so that the median height is 2200 pixels and the median width is 1700 pixels. See Figure 3 for sample images. A.2 IAM and RIMES 2009 Toobtaintextlinepositionannotations,weinitiallyusedsegmentationmodelsto generate pre-annotations. However, after matching the pre-annotations with text l...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.