RT-DocLayout: Real-Time End-to-End Document Layout Analysis with Reading Order in the Wild

Changda Zhou; Cheng Cui; Hongen Liu; Jiaxuan Liu; Manhui Lin; Suyin Liang; Tingquan Gao; Ting Sun; Xueqing Wang; Yi Liu

arxiv: 2606.23344 · v1 · pith:OX3F7UCYnew · submitted 2026-06-22 · 💻 cs.CV

RT-DocLayout: Real-Time End-to-End Document Layout Analysis with Reading Order in the Wild

Cheng Cui , Tingquan Gao , Xueqing Wang , Changda Zhou , Hongen Liu , Ting Sun , Yubo Zhang , Zelun Zhang

show 6 more authors

Jiaxuan Liu Manhui Lin Yue Zhang Suyin Liang Yiqing Xiang Yi Liu

This is my paper

Pith reviewed 2026-06-26 09:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords document layout analysisreading order predictionend-to-end frameworkmulti-task learningreal-time inferenceobject detectionpixel segmentation

0 comments

The pith

A single 33M-parameter model unifies classification, detection, segmentation and reading order prediction for documents in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a unified multi-task decoder can handle all core layout analysis steps together instead of relying on separate stages that pass errors along. Built on an existing detection architecture, the approach jointly learns geometric and structural cues through one query-based decoder that classifies elements, regresses boxes, produces masks, and infers reading order relationships. A sympathetic reader would care because this setup is claimed to deliver greater robustness to common real-world distortions such as warping and perspective shifts while running at high speed. If the claim holds, document parsing systems could use a lighter, faster front-end that improves downstream reconstruction quality when paired with OCR.

Core claim

The central claim is that a unified multi-task formulation inside a single query-based decoder, derived from RT-DETR, simultaneously classifies layout elements, regresses their bounding boxes, generates pixel-level masks, and constructs relationships to determine reading order. This joint optimization of geometric and structural representations is presented as the mechanism that yields state-of-the-art accuracy on public benchmarks while sustaining real-time inference at 132.1 FPS and improving full-document reconstruction when the model is coupled with OCR engines.

What carries the argument

The unified multi-task formulation within a single query-based decoder that classifies, regresses bounding boxes, generates masks, and constructs relationships for reading order.

If this is right

State-of-the-art results on standard document layout analysis benchmarks.
Real-time inference speed of 132.1 FPS with a 33M-parameter model.
Reduced error propagation relative to multi-stage pipelines.
Higher quality full-document reconstruction when paired with downstream OCR engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-decoder pattern could be tested on video or scene graphs where detection and ordering must be solved together.
Deployment on mobile hardware becomes feasible given the modest parameter count and speed.
Adding text recognition as an extra task head might further tighten the coupling between layout and content understanding.

Load-bearing premise

Joint multi-task optimization inside one decoder produces substantially better robustness to geometric distortions than prior multi-stage pipelines.

What would settle it

A side-by-side test on a controlled set of warped and perspective-distorted documents that measures layout and reading-order accuracy for the joint decoder versus an otherwise identical model with separate task heads.

read the original abstract

Accurate document layout analysis remains a critical bottleneck for document parsing systems, due to the intricate coupling among heterogeneous document layout elements, geometric distortions (\eg, paper warping and bending, perspective variations), and reading order within diverse layout structures. Existing approaches typically rely on fragmented multi-stage pipelines or computationally heavy generative Transformer architectures, leading to error propagation and limited efficiency. In this paper, we present RT-DocLayout, a highly efficient end-to-end framework for document layout analysis, designed as a front-end for document parsing tasks. The proposed model unifies classification, detection, pixel-level segmentation, and reading order prediction for layout elements within a single 33M-parameter architecture. Built upon the RT-DETR, our key contribution is a unified multi-task formulation within a single query-based decoder that simultaneously classifies, regresses bounding box, generates masks, and constructs relationship to reason reading order. By jointly learning geometric and structural representations, RT-DocLayout introduces multi-task optimization that substantially improves robustness under real-world document distortions. Extensive experiments on public benchmarks demonstrate state-of-the-art performance in document layout analysis while maintaining real-time inference speed(132.1 FPS). When coupled with downstream OCR engines, RT-DocLayout significantly improves full-document reconstruction quality, providing a scalable and practical foundation for real-world document intelligence systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RT-DocLayout puts reading-order prediction into a single RT-DETR decoder for layout analysis, but the abstract gives no ablations or distortion-specific numbers to back the robustness claim.

read the letter

The paper's main move is to fold classification, box regression, mask prediction, and reading-order relations into one query-based decoder on top of RT-DETR. That produces a 33M-parameter model that runs at 132 FPS and is meant to serve as a front-end for OCR pipelines.

The end-to-end design is a reasonable response to the error propagation that comes with separate detection-then-order stages. Treating reading order as an explicit relation inside the same decoder is a concrete addition that prior DETR-style layout work did not include.

The claim that joint geometric and structural training yields substantially better tolerance to warping and perspective shifts is stated directly, yet the abstract supplies no ablation that isolates the multi-task decoder, no metrics on distorted subsets, and no error-propagation comparison against staged baselines. Without those, the causal link between the unified formulation and the robustness gain stays unshown.

The work is aimed at people who need a fast, integrated layout front-end rather than at theorists. If the full paper contains the missing ablations and the reported SOTA numbers hold under scrutiny, it is worth a referee's time; the practical framing and the speed figure are enough to justify review even if the robustness story needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces RT-DocLayout, a 33M-parameter end-to-end model extending RT-DETR with a unified query-based decoder that jointly performs classification, bounding-box regression, pixel-level mask generation, and reading-order relation prediction for document layout elements. It claims that this single multi-task formulation substantially improves robustness to geometric distortions (warping, perspective) compared to multi-stage pipelines, delivers state-of-the-art results on public benchmarks, and sustains real-time inference at 132.1 FPS, thereby improving downstream full-document reconstruction when paired with OCR engines.

Significance. If the performance and robustness claims are substantiated, the work would offer a practical, efficient front-end for document parsing pipelines by reducing error propagation and enabling real-time operation on distorted inputs, which is valuable for real-world document intelligence systems.

major comments (2)

[Abstract] Abstract: the central claim that 'multi-task optimization that substantially improves robustness under real-world document distortions' is presented without any referenced ablation (e.g., joint vs. separate heads), distortion-specific metrics (mAP/F1 on warped/perspective subsets), or error-propagation analysis comparing the unified decoder to prior multi-stage pipelines; this directly underpins the key contribution.
[Abstract] Abstract: the assertion of 'state-of-the-art performance' and 'extensive experiments on public benchmarks' is unsupported by any numerical results, baseline comparisons, dataset names, or tables in the provided text, preventing evaluation of the magnitude or statistical significance of the reported gains.

minor comments (1)

[Abstract] Typo: 'speed(132.1 FPS)' should read 'speed (132.1 FPS)'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues with the abstract. The full manuscript contains the requested ablations, metrics, and tables; we will revise the abstract to reference them explicitly so the claims are better supported.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'multi-task optimization that substantially improves robustness under real-world document distortions' is presented without any referenced ablation (e.g., joint vs. separate heads), distortion-specific metrics (mAP/F1 on warped/perspective subsets), or error-propagation analysis comparing the unified decoder to prior multi-stage pipelines; this directly underpins the key contribution.

Authors: The manuscript includes these analyses (ablations on joint vs. separate heads, distortion-specific mAP on warped/perspective subsets, and error-propagation comparisons) in the experiments section. We will revise the abstract to add a brief reference to these results and key metrics, making the robustness claim directly traceable. revision: yes
Referee: [Abstract] Abstract: the assertion of 'state-of-the-art performance' and 'extensive experiments on public benchmarks' is unsupported by any numerical results, baseline comparisons, dataset names, or tables in the provided text, preventing evaluation of the magnitude or statistical significance of the reported gains.

Authors: The full paper reports numerical results, baselines, and dataset names (e.g., PubLayNet, DocBank) in tables. We will update the abstract to include representative SOTA numbers and benchmark names so the performance claims are concrete and evaluable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on benchmark experiments, not self-referential derivations

full rationale

The paper presents RT-DocLayout as an architectural unification of tasks inside a query-based decoder built on RT-DETR, with performance claims (SOTA accuracy at 132.1 FPS, improved robustness) explicitly tied to 'extensive experiments on public benchmarks.' No equations, fitted parameters, or uniqueness theorems are introduced that reduce by construction to the inputs; the multi-task optimization benefit is asserted as an observed outcome rather than a definitional or self-cited necessity. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or invented entities are described beyond the stated model size and the assumption that multi-task learning improves robustness.

pith-pipeline@v0.9.1-grok · 5811 in / 1228 out tokens · 32474 ms · 2026-06-26T09:17:46.090445+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 1 canonical work pages

[1]

Rope: Reading order equivariant positional encoding for graph- based document information extraction, 2021

Chen-Yu Lee, Chun-Liang Li, Chu Wang, Renshen Wang, Yasuhisa Fujii, Siyang Qin, Ashok Popat, and Tomas Pfister. Rope: Reading order equivariant positional encoding for graph- based document information extraction, 2021. URL https://arxiv.org/abs/2106.1 0786

2021
[2]

Recursive xy cut using bounding boxes of connected components

Jaekyu Ha, Robert M Haralick, and Ihsin T Phillips. Recursive xy cut using bounding boxes of connected components. InProceedings of 3rd International Conference on Document Analysis and Recognition, volume 2, pages 952–955. IEEE, 1995

1995
[3]

Layoutlmv3: Pre-training for document ai with unified text and image masking, 2022

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking, 2022. URL https://arxiv.org/ abs/2204.08387

arXiv 2022
[4]

Dit: Self-supervised pre-training for document image transformer, 2022

Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. Dit: Self-supervised pre-training for document image transformer, 2022. URL https://arxiv.org/abs/22 03.02378

2022
[5]

Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...

Pith/arXiv arXiv 2025
[6]

Dolphin: Document image parsing via heterogeneous anchor prompting, 2025

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, and Can Huang. Dolphin: Document image parsing via heterogeneous anchor prompting, 2025. URL https://arxiv.org/abs/25 05.14059

2025
[7]

Detrs beat yolos on real-time object detection, 2024

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection, 2024. URL https: //arxiv.org/abs/2304.08069

arXiv 2024
[8]

Faster r-cnn: Towards real-time object detection with region proposal networks, 2016

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016. URL https://arxiv.org/abs/ 1506.01497

Pith/arXiv arXiv 2016
[9]

Mask r-cnn, 2018

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn, 2018. URL https://arxiv.org/abs/1703.06870

Pith/arXiv arXiv 2018
[10]

Publaynet: largest dataset ever for document layout analysis, 2019

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis, 2019. URLhttps://arxiv.org/abs/1908.07836

arXiv 2019
[11]

Layoutparser: A unified toolkit for deep learning based document image analysis, 2021

Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. Layoutparser: A unified toolkit for deep learning based document image analysis, 2021. URLhttps://arxiv.org/abs/2103.15348

arXiv 2021
[12]

Layoutlm: Pre-training of text and layout for document image understanding

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 1192–1200. ACM, August 2020. doi: 10.1145/3394486.3403172. URL http://dx.doi.org/10.1145/33...

work page doi:10.1145/3394486.3403172 2020
[13]

Layoutlmv2: Multi- modal pre-training for visually-rich document understanding, 2022

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. Layoutlmv2: Multi- modal pre-training for visually-rich document understanding, 2022. URL https://arxi v.org/abs/2012.14740

arXiv 2022
[14]

Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception, 2024. URLhttps://arxiv.org/abs/2410.12628

arXiv 2024
[15]

Layoutreader: Pre-training of text and layout for reading order detection, 2021

Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. Layoutreader: Pre-training of text and layout for reading order detection, 2021. URL https://arxiv.org/abs/21 08.11591

2021
[16]

Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding, 2022

Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, and Liqing Zhang. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding, 2022. URLhttps://arxiv.org/abs/2203.06947. 14

arXiv 2022
[17]

Ocr-free document understanding transformer, 2022

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer, 2022. URL https://arxiv.org/abs/2111.156 64

2022
[18]

Nougat: Neural optical understanding for academic documents, 2023

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents, 2023. URL https://arxiv.org/abs/ 2308.13418

Pith/arXiv arXiv 2023
[19]

Unifying vision, text, and layout for universal document processing, 2023

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing, 2023. URLhttps://arxiv.org/abs/2212.02623

arXiv 2023
[20]

Mineru2.0-2505-0.9b

opendatalab. Mineru2.0-2505-0.9b. https://huggingface.co/opendatalab/Miner U2.0-2505-0.9B, 2025

2025
[21]

dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv. org/abs/2512.02498

arXiv 2025
[22]

Dlaformer: An end-to-end transformer for document layout analysis, 2024

Jiawei Wang, Kai Hu, and Qiang Huo. Dlaformer: An end-to-end transformer for document layout analysis, 2024. URLhttps://arxiv.org/abs/2405.11757

arXiv 2024
[23]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model, 2025. URL https://arxiv.org/abs/2510.14528

arXiv 2025
[24]

Ni, and Heung-Yeung Shum

Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022. URLhttps://arxiv.org/abs/2206.02777

arXiv 2022
[25]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations, 2025

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxi- ang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations, 2025. URL https://arxiv.org/abs...

arXiv 2025
[26]

Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026

Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, and Yi Liu. Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026. URLhttps://arxiv.org/abs/2603.04205

Pith/arXiv arXiv 2026
[27]

Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm, 2026

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm, 2026. URL https://arxiv.org/ abs/2506.05218

arXiv 2026
[28]

Dolphin-v2: Universal document parsing via scalable anchor prompting, 2026

Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, and Can Huang. Dolphin-v2: Universal document parsing via scalable anchor prompting, 2026. URL https://arxiv.org/abs/2602.053 84. 15

2026
[29]

Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026. URLhttps://arxiv.org/abs/2601.21957. 16

Pith/arXiv arXiv 2026

[1] [1]

Rope: Reading order equivariant positional encoding for graph- based document information extraction, 2021

Chen-Yu Lee, Chun-Liang Li, Chu Wang, Renshen Wang, Yasuhisa Fujii, Siyang Qin, Ashok Popat, and Tomas Pfister. Rope: Reading order equivariant positional encoding for graph- based document information extraction, 2021. URL https://arxiv.org/abs/2106.1 0786

2021

[2] [2]

Recursive xy cut using bounding boxes of connected components

Jaekyu Ha, Robert M Haralick, and Ihsin T Phillips. Recursive xy cut using bounding boxes of connected components. InProceedings of 3rd International Conference on Document Analysis and Recognition, volume 2, pages 952–955. IEEE, 1995

1995

[3] [3]

Layoutlmv3: Pre-training for document ai with unified text and image masking, 2022

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking, 2022. URL https://arxiv.org/ abs/2204.08387

arXiv 2022

[4] [4]

Dit: Self-supervised pre-training for document image transformer, 2022

Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. Dit: Self-supervised pre-training for document image transformer, 2022. URL https://arxiv.org/abs/22 03.02378

2022

[5] [5]

Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...

Pith/arXiv arXiv 2025

[6] [6]

Dolphin: Document image parsing via heterogeneous anchor prompting, 2025

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, and Can Huang. Dolphin: Document image parsing via heterogeneous anchor prompting, 2025. URL https://arxiv.org/abs/25 05.14059

2025

[7] [7]

Detrs beat yolos on real-time object detection, 2024

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection, 2024. URL https: //arxiv.org/abs/2304.08069

arXiv 2024

[8] [8]

Faster r-cnn: Towards real-time object detection with region proposal networks, 2016

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016. URL https://arxiv.org/abs/ 1506.01497

Pith/arXiv arXiv 2016

[9] [9]

Mask r-cnn, 2018

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn, 2018. URL https://arxiv.org/abs/1703.06870

Pith/arXiv arXiv 2018

[10] [10]

Publaynet: largest dataset ever for document layout analysis, 2019

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis, 2019. URLhttps://arxiv.org/abs/1908.07836

arXiv 2019

[11] [11]

Layoutparser: A unified toolkit for deep learning based document image analysis, 2021

Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. Layoutparser: A unified toolkit for deep learning based document image analysis, 2021. URLhttps://arxiv.org/abs/2103.15348

arXiv 2021

[12] [12]

Layoutlm: Pre-training of text and layout for document image understanding

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 1192–1200. ACM, August 2020. doi: 10.1145/3394486.3403172. URL http://dx.doi.org/10.1145/33...

work page doi:10.1145/3394486.3403172 2020

[13] [13]

Layoutlmv2: Multi- modal pre-training for visually-rich document understanding, 2022

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. Layoutlmv2: Multi- modal pre-training for visually-rich document understanding, 2022. URL https://arxi v.org/abs/2012.14740

arXiv 2022

[14] [14]

Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception, 2024. URLhttps://arxiv.org/abs/2410.12628

arXiv 2024

[15] [15]

Layoutreader: Pre-training of text and layout for reading order detection, 2021

Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. Layoutreader: Pre-training of text and layout for reading order detection, 2021. URL https://arxiv.org/abs/21 08.11591

2021

[16] [16]

Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding, 2022

Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, and Liqing Zhang. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding, 2022. URLhttps://arxiv.org/abs/2203.06947. 14

arXiv 2022

[17] [17]

Ocr-free document understanding transformer, 2022

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer, 2022. URL https://arxiv.org/abs/2111.156 64

2022

[18] [18]

Nougat: Neural optical understanding for academic documents, 2023

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents, 2023. URL https://arxiv.org/abs/ 2308.13418

Pith/arXiv arXiv 2023

[19] [19]

Unifying vision, text, and layout for universal document processing, 2023

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing, 2023. URLhttps://arxiv.org/abs/2212.02623

arXiv 2023

[20] [20]

Mineru2.0-2505-0.9b

opendatalab. Mineru2.0-2505-0.9b. https://huggingface.co/opendatalab/Miner U2.0-2505-0.9B, 2025

2025

[21] [21]

dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv. org/abs/2512.02498

arXiv 2025

[22] [22]

Dlaformer: An end-to-end transformer for document layout analysis, 2024

Jiawei Wang, Kai Hu, and Qiang Huo. Dlaformer: An end-to-end transformer for document layout analysis, 2024. URLhttps://arxiv.org/abs/2405.11757

arXiv 2024

[23] [23]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model, 2025. URL https://arxiv.org/abs/2510.14528

arXiv 2025

[24] [24]

Ni, and Heung-Yeung Shum

Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022. URLhttps://arxiv.org/abs/2206.02777

arXiv 2022

[25] [25]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations, 2025

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxi- ang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations, 2025. URL https://arxiv.org/abs...

arXiv 2025

[26] [26]

Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026

Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, and Yi Liu. Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026. URLhttps://arxiv.org/abs/2603.04205

Pith/arXiv arXiv 2026

[27] [27]

Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm, 2026

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm, 2026. URL https://arxiv.org/ abs/2506.05218

arXiv 2026

[28] [28]

Dolphin-v2: Universal document parsing via scalable anchor prompting, 2026

Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, and Can Huang. Dolphin-v2: Universal document parsing via scalable anchor prompting, 2026. URL https://arxiv.org/abs/2602.053 84. 15

2026

[29] [29]

Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026. URLhttps://arxiv.org/abs/2601.21957. 16

Pith/arXiv arXiv 2026