Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
Pith reviewed 2026-05-23 19:13 UTC · model grok-4.3
The pith
A survey organizes document parsing methods into modular pipeline systems and unified VLM-driven models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a systematic taxonomy that classifies document parsing approaches into modular pipeline-based systems, which decompose tasks such as layout analysis and recognition of heterogeneous content including text, tables, mathematical expressions and visual elements, and unified models driven by Vision-Language Models, while also reviewing the evolution of those VLMs, widely used evaluation metrics, high-quality benchmarks, and remaining challenges in robustness, reliability, and efficiency.
What carries the argument
The taxonomy that divides existing approaches into modular pipeline-based systems and unified VLM-driven models
If this is right
- Pipeline systems support targeted improvements in individual stages such as layout analysis and content recognition.
- Unified VLM models enable end-to-end parsing that handles complex document structures without separate modules.
- Standardized benchmarks and metrics allow consistent comparison of parsing quality across methods.
- Resolving challenges in robustness to complex layouts and VLM reliability will support more scalable document intelligence systems.
Where Pith is reading between the lines
- The taxonomy may help researchers decide whether to refine separate modules or invest in larger unified models for specific applications.
- Widespread adoption of the VLM route could reduce reliance on hand-crafted pipeline stages over time.
- Improved parsing efficiency would directly benefit downstream systems that ingest large document collections.
- Future work could test whether hybrid approaches that combine pipeline modularity with VLM capabilities outperform either category alone.
Load-bearing premise
The body of literature selected for review is representative of the field and the proposed taxonomy captures the primary distinctions between approaches without major omissions or overlaps.
What would settle it
Discovery of a substantial body of document parsing methods that cannot be placed into either the pipeline category or the unified VLM category without forcing significant overlap or omission.
Figures
read the original abstract
Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG). This survey provides a comprehensive and timely review of document parsing research. We propose a systematic taxonomy that organizes existing approaches into modular pipeline-based systems and unified models driven by Vision-Language Models (VLMs). We provide a detailed review of key components in pipeline systems, including layout analysis and the recognition of heterogeneous content such as text, tables, mathematical expressions, and visual elements, and then systematically track the evolution of specialized VLMs for document parsing. Additionally, we summarize widely adopted evaluation metrics and high-quality benchmarks that establish current standards for parsing quality. Finally, we discuss key open challenges, including robustness to complex layouts, reliability of VLM-based parsing, and inference efficiency, and outline directions for building more accurate and scalable document intelligence systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This survey reviews document parsing techniques that convert unstructured or semi-structured documents into machine-readable structured representations for downstream tasks such as knowledge base construction and retrieval-augmented generation. It proposes a taxonomy dividing existing methods into modular pipeline-based systems and unified models based on Vision-Language Models (VLMs). The paper reviews pipeline components including layout analysis and recognition of text, tables, mathematical expressions, and visual elements; tracks the development of specialized VLMs for document parsing; summarizes evaluation metrics and benchmarks; and discusses challenges such as robustness to complex layouts, VLM reliability, and inference efficiency, along with future research directions.
Significance. If the taxonomy is well-justified and the literature coverage representative, the survey offers a timely organizational framework for a fast-moving area, particularly the transition from pipelines to VLM-centric approaches. It can serve as a reference point for identifying gaps in robustness and scalability, aiding researchers working on document intelligence systems.
major comments (1)
- [Abstract] Abstract: The central claim that the work provides a 'comprehensive' review and 'systematic taxonomy' is load-bearing for the paper's contribution, yet the abstract provides no details on literature search methodology, inclusion criteria, or time period covered; this makes it impossible to evaluate whether the taxonomy omits major lines of work or contains unacknowledged overlaps between the two categories.
minor comments (3)
- [Taxonomy] The taxonomy description would be clearer with an explicit figure or table contrasting the two categories (modular pipelines vs. unified VLMs) and their sub-components.
- [Challenges] The challenges section would benefit from citing specific quantitative results (e.g., error rates or failure modes on named benchmarks) to ground the discussion of robustness and reliability limitations.
- [Evaluation Metrics and Benchmarks] Ensure all cited benchmarks and metrics are accompanied by references to their original papers to allow readers to trace the evaluation standards.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and constructive feedback on our survey. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the work provides a 'comprehensive' review and 'systematic taxonomy' is load-bearing for the paper's contribution, yet the abstract provides no details on literature search methodology, inclusion criteria, or time period covered; this makes it impossible to evaluate whether the taxonomy omits major lines of work or contains unacknowledged overlaps between the two categories.
Authors: We agree that the abstract should provide greater transparency regarding the literature review process to support the claims of comprehensiveness and systematic organization. The taxonomy in the manuscript was constructed by surveying peer-reviewed works primarily from 2018 onward in venues such as CVPR, ICCV, NeurIPS, ACL, and related journals, with inclusion focused on methods addressing layout analysis, content recognition, and VLM-based parsing; overlaps between pipeline and VLM categories are explicitly discussed in Section 3. To address the referee's point, we will revise the abstract to include a concise statement on the search methodology, inclusion criteria, and covered time period. revision: yes
Circularity Check
Survey paper with no derivations or predictions exhibits no circularity
full rationale
This is a literature survey proposing a descriptive taxonomy that organizes prior document parsing work into modular pipeline-based systems versus unified VLM-driven models. It reviews components, metrics, benchmarks, and challenges without any equations, quantitative predictions, fitted parameters, or novel derivations. The central claim is an organizational framework for existing literature rather than a result derived from self-referential assumptions or self-citations; the representativeness of selected papers is a standard survey concern external to any internal derivation chain. No load-bearing steps reduce to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 6 Pith papers
-
A document is worth a structured record: Principled inductive bias design for document recognition
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, ...
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
-
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
-
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
-
MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop
MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.
-
RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering
RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.
Reference graph
Works this paper leans on
- [1]
-
[2]
Ridhi Aggarwal, Shilpa Pandey, Anil Kumar Tiwari, and Gaurav Harit. 2022. Survey of mathematical expression recognition for printed and handwritten documents. IETE Technical Review 39, 6 (2022), 1245–1253
work page 2022
-
[3]
Md Mutasim Billah Abu Noman Akanda, Maruf Ahmed, AKM Shahariar Azad Rabby, and Fuad Rahman. 2024. Optimum Deep Learning Method for Document Layout Analysis in Low Resource Languages. In Proceedings of the 2024 ACM Southeast Conference . 199–204
work page 2024
-
[4]
Rabah Al-Zaidy and C Giles. 2017. A machine learning approach for semantic structuring of scientific charts in scholarly documents. InProceedings of the AAAI Conference on Artificial Intelligence , Vol. 31. 4644–4649
work page 2017
-
[5]
Robert H Anderson. 1967. Syntax-directed recognition of hand-printed two-dimensional mathematics. In Symposium on interactive systems for experimental applied mathematics: Proceedings of the Association for Computing Machinery Inc. Symposium . 436–459
work page 1967
-
[6]
Dan Anitei, Joan Andreu Sánchez, José Manuel Fuentes, Roberto Paredes, and José Miguel Benedí. 2021. ICDAR 2021 competition on mathematical formula detection. In International Conference on Document Analysis and Recognition . Springer, 783–795
work page 2021
-
[7]
Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan Pletschacher. 2009. A realistic dataset for performance evaluation of document layout analysis. In 2009 10th International Conference on Document Analysis and Recognition . IEEE, 296–300
work page 2009
-
[8]
Emilia Apostolova, Daekeun You, Zhiyun Xue, Sameer Antani, Dina Demner-Fushman, and George R Thoma. 2013. Image retrieval from scientific publications: Text and image content processing to separate multipanel figures. Journal of the American Society for Information Science and Technology 64, 5 (2013), 893–908
work page 2013
-
[9]
Tiago Araújo, Paulo Chagas, Joao Alves, Carlos Santos, Beatriz Sousa Santos, and Bianchi Serique Meiguins. 2020. A real-world approach on the problem of chart recognition using classification, detection and perspective correction. Sensors 20, 16 (2020), 4370
work page 2020
-
[10]
Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 6679–6687
work page 2021
-
[11]
Rowel Atienza. 2021. Vision transformer for fast and efficient scene text recognition. InInternational conference on document analysis and recognition. Springer, 319–334
work page 2021
-
[12]
Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 9365–9374
work page 2019
-
[13]
Youngmin Baek, Seung Shin, Jeonghun Baek, Sungrae Park, Junyeop Lee, Daehyun Nam, and Hwalsuk Lee. 2020. Character region attention for text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16 . Springer, 504–521
work page 2020
-
[14]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. (2023)
work page 2023
-
[15]
Ayan Banerjee, Sanket Biswas, Josep Lladós, and Umapada Pal. 2024. SemiDocSeg: harnessing semi-supervised learning for document layout analysis. International Journal on Document Analysis and Recognition (IJDAR) (2024), 1–18
work page 2024
-
[16]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Darwin Bautista and Rowel Atienza. 2022. Scene text recognition with permuted autoregressive sequence models. In European conference on computer vision. Springer, 178–196
work page 2022
-
[18]
Dipali Baviskar, Swati Ahirrao, Vidyasagar Potdar, and Ketan Kotecha. 2021. Efficient automated processing of the unstructured documents using artificial intelligence: A systematic literature review and future directions. IEEE Access 9 (2021), 72894–72936
work page 2021
-
[19]
Galal M Binmakhashen and Sabri A Mahmoud. 2019. Document layout analysis: a comprehensive survey. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–36
work page 2019
-
[20]
Lukas Blecher. 2022. pix2tex - LaTeX OCR. https://github.com/lukas-blecher/LaTeX-OCR. Accessed: 2024-2-29
work page 2022
-
[21]
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2023. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE international conference on computer vision . 2204–2212
work page 2017
-
[23]
Céres Carton, Aurélie Lemaitre, and Bertrand Coüasnon. 2013. Fusion of statistical and structural information for flowchart recognition. In 2013 12th International Conference on Document Analysis and Recognition . IEEE, 1210–1214
work page 2013
-
[24]
Paulo Chagas, Rafael Akiyama, Aruanda Meiguins, Carlos Santos, Filipe Saraiva, Bianchi Meiguins, and Jefferson Morais. 2018. Evaluation of convolutional neural network architectures for chart image classification. In 2018 International Joint Conference on Neural Networks (IJCNN) . IEEE, 1–8
work page 2018
-
[25]
Chungkwong Chan. 2020. Stroke extraction for offline handwritten mathematical expression recognition. IEEE Access 8 (2020), 61565–61575
work page 2020
- [26]
-
[27]
Jingye Chen, Bin Li, and Xiangyang Xue. 2021. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 12026–12035. Manuscript submitted to ACM 24 Zhang et al
work page 2021
- [28]
-
[29]
Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollár. 2019. Tensormask: A foundation for dense object segmentation. In Proceedings of the IEEE/CVF international conference on computer vision . 2061–2069
work page 2019
-
[30]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198
work page 2024
-
[32]
Beibei Cheng, Sameer Antani, R Joe Stanley, and George R Thoma. 2011. Automatic segmentation of subfigure image panels for multimodal biomedical document retrieval. In Document Recognition and Retrieval XVIII , Vol. 7874. SPIE, 294–304
work page 2011
-
[33]
Hiuyi Cheng, Peirong Zhang, Sihang Wu, Jiaxin Zhang, Qiyuan Zhu, Zecheng Xie, Jing Li, Kai Ding, and Lianwen Jin. 2023. M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi-annotation category dataset for modern document layout analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 15...
work page 2023
-
[34]
Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Aon: Towards arbitrarily-oriented text recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 5571–5579
work page 2018
- [35]
-
[36]
Hugh A Chipman, Edward I George, Robert E McCulloch, and Thomas S Shively. 2022. mBART: multidimensional monotone BART. Bayesian Analysis 17, 2 (2022), 515–544
work page 2022
-
[37]
Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In2017 14th IAPR international conference on document analysis and recognition (ICDAR) , Vol. 1. IEEE, 935–942
work page 2017
-
[38]
Sagnik Ray Choudhury, Shuting Wang, and C Lee Giles. 2016. Scalable algorithms for scholarly figure mining and semantics. In Proceedings of the International Workshop on Semantic Big Data . 1–6
work page 2016
-
[39]
Sagnik Ray Choudhury, Shuting Wang, Prasenjit Mitra, and C Lee Giles. 2015. Automated data extraction from scholarly line graphs. In Proc. Int. Workshop Graph. Recognit
work page 2015
-
[40]
Mathieu Cliche, David Rosenberg, Dhruv Madeka, and Connie Yee. 2017. Scatteract: Automated extraction of data from scatter plots. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part I 10. Springer, 135–150
work page 2017
-
[41]
Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. 2023. Vision grid transformer for document layout analysis. In Proceedings of the IEEE/CVF international conference on computer vision . 19462–19472
work page 2023
-
[42]
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems 29 (2016)
work page 2016
-
[43]
Wenjing Dai, Meng Wang, Zhibin Niu, and Jiawan Zhang. 2018. Chart decoder: Generating textual and numeric information from chart images automatically. Journal of Visual Languages & Computing 48 (2018), 101–109
work page 2018
-
[44]
Kenny Davila, Bhargava Urala Kota, Srirangaraj Setlur, Venu Govindaraju, Christopher Tensmeyer, Sumit Shekhar, and Ritwick Chaudhry. 2019. ICDAR 2019 competition on harvesting raw tables from infographics (chart-infographics). In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1594–1599
work page 2019
-
[45]
Kenny Davila, Srirangaraj Setlur, David Doermann, Bhargava Urala Kota, and Venu Govindaraju. 2020. Chart mining: A survey of methods for automated chart analysis. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 3799–3819
work page 2020
-
[46]
Kenny Davila, Chris Tensmeyer, Sumit Shekhar, Hrituraj Singh, Srirangaraj Setlur, and Venu Govindaraju. 2021. ICPR 2020-competition on harvesting raw tables from infographics. In International Conference on Pattern Recognition . Springer, 361–380
work page 2021
-
[47]
Kenny Davila, Fei Xu, Saleem Ahmed, David A Mendoza, Srirangaraj Setlur, and Venu Govindaraju. 2022. Icpr 2022: Challenge on harvesting raw tables from infographics (chart-infographics). In 2022 26th International Conference on Pattern Recognition (ICPR) . IEEE, 4995–5001
work page 2022
-
[48]
Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. Pixellink: Detecting scene text via instance segmentation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32
work page 2018
-
[49]
Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. 2017. generation with coarse-to-fine attention. In International Conference on Machine Learning. PMLR, 980–989
work page 2017
-
[50]
Yuntian Deng, David Rosenberg, and Gideon Mann. 2019. Challenges in end-to-end neural scientific table recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, 894–901
work page 2019
- [51]
-
[52]
Harsh Desai, Pratik Kayal, and Mayank Singh. 2021. TabLeX: a benchmark dataset for structure and content information extraction from scientific tables. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16. Springer, 554–569. Manuscript submitted to ACM Document...
work page 2021
-
[53]
Anurag Dhote, Mohammed Javed, and David S Doermann. 2023. A survey and approach to chart classification. In International Conference on Document Analysis and Recognition. Springer, 67–82
work page 2023
-
[54]
Anurag Dhote, Mohammed Javed, and David S Doermann. 2024. Swin-chart: An efficient approach for chart classification. Pattern Recognition Letters 185 (2024), 203–209
work page 2024
-
[55]
Daniel Drevon, Sophie R Fursa, and Allura L Malcolm. 2017. Intercoder reliability and validity of WebPlotDigitizer in extracting graphed data. Behavior modification 41, 2 (2017), 323–339
work page 2017
- [56]
-
[57]
Randa Elanwar, Wenda Qin, Margrit Betke, and Derry Wijaya. 2021. Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model. International Journal on Document Analysis and Recognition (IJDAR) 24, 4 (2021), 349–362
work page 2021
-
[58]
Jing Fang, Xin Tao, Zhi Tang, Ruiheng Qiu, and Ying Liu. 2012. Dataset, ground-truth and performance metrics for table detection evaluation. In 2012 10th IAPR International Workshop on Document Analysis Systems . IEEE, 445–449
work page 2012
-
[59]
Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 7098–7107
work page 2021
- [60]
-
[61]
Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF international conference on computer vision . 9076–9085
work page 2019
-
[62]
Jinglun Gao, Yin Zhou, and Kenneth E Barner. 2012. View: Visual information extraction widget for improving chart images accessibility. In 2012 19th IEEE international conference on image processing . IEEE, 2865–2868
work page 2012
-
[63]
Liangcai Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and Eva Lang. 2019. ICDAR 2019 competition on table detection and recognition (cTDaR). In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, 1510–1515
work page 2019
-
[64]
Liangcai Gao, Xiaohan Yi, Zhuoren Jiang, Leipeng Hao, and Zhi Tang. 2017. ICDAR2017 competition on page object detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 1417–1422
work page 2017
-
[65]
Liangcai Gao, Xiaohan Yi, Yuan Liao, Zhuoren Jiang, Zuoyu Yan, and Zhi Tang. 2017. A deep learning-based formula detection method for PDF documents. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 553–558
work page 2017
-
[66]
Azka Gilani, Shah Rukh Qasim, Imran Malik, and Faisal Shafait. 2017. Table detection using deep learning. In2017 14th IAPR international conference on document analysis and recognition (ICDAR) , Vol. 1. IEEE, 771–776
work page 2017
-
[67]
Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. 2013. ICDAR 2013 table competition. In 2013 12th international conference on document analysis and recognition. IEEE, 1449–1453
work page 2013
-
[68]
Adel Got, Djaafar Zouache, Abdelouahab Moussaoui, Laith Abualigah, and Ahmed Alsayat. 2024. Improved manta ray foraging optimizer-based SVM for feature selection problems: a medical case study. Journal of Bionic Engineering 21, 1 (2024), 409–425
work page 2024
-
[69]
Tobias Grüning, Gundram Leifert, Tobias Strauß, Johannes Michael, and Roger Labahn. 2019. A two-stage method for text line detection in historical documents. International Journal on Document Analysis and Recognition (IJDAR) 22, 3 (2019), 285–302
work page 2019
-
[70]
Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Nikolaos Barmpalios, Ani Nenkova, and Tong Sun. 2021. Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems 34 (2021), 39–50
work page 2021
- [71]
-
[72]
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition . 2315–2324
work page 2016
-
[73]
Jan Hajič and Pavel Pecina. 2017. The MUSCIMA++ dataset for handwritten optical music recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 39–46
work page 2017
- [74]
- [75]
-
[76]
Leipeng Hao, Liangcai Gao, Xiaohan Yi, and Zhi Tang. 2016. A table detection method for pdf documents based on convolutional neural networks. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) . IEEE, 287–292
work page 2016
-
[77]
Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. 2021. Cascade network with deformable composite backbone for formula detection in scanned document images. Applied Sciences 11, 16 (2021), 7610
work page 2021
-
[78]
Muhammad Yusuf Hassan, Mayank Singh, et al. 2023. Lineex: data extraction from scientific line charts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . 6213–6221
work page 2023
-
[79]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 16000–16009
work page 2022
-
[80]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778. Manuscript submitted to ACM 26 Zhang et al
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.