Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Bin Wang; Conghui He; Hao Liang; Junyuan Zhang; Qintong Zhang; Victor Shea-Jay Huang; Wentao Zhang; Zhengren Wang

arxiv: 2410.21169 · v5 · submitted 2024-10-28 · 💻 cs.MM · cs.AI· cs.CL· cs.CV

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Qintong Zhang , Bin Wang , Victor Shea-Jay Huang , Junyuan Zhang , Zhengren Wang , Hao Liang , Conghui He , Wentao Zhang This is my paper

Pith reviewed 2026-05-23 19:13 UTC · model grok-4.3

classification 💻 cs.MM cs.AIcs.CLcs.CV

keywords document parsingstructured information extractionvision-language modelslayout analysistable recognitionmathematical expressionsevaluation benchmarksVLM-based parsing

0 comments

The pith

A survey organizes document parsing methods into modular pipeline systems and unified VLM-driven models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Document parsing converts unstructured or semi-structured documents into structured machine-readable formats that support tasks such as knowledge base construction and retrieval-augmented generation. The paper reviews existing work and introduces a taxonomy that separates approaches into pipeline-based systems, which handle layout analysis and recognition of text, tables, math, and visuals in stages, versus unified models built on vision-language models. It details the components of pipeline systems, traces how specialized VLMs have developed for parsing, and compiles standard evaluation metrics along with key benchmarks. The survey closes by identifying open issues in robustness, VLM reliability, and inference speed while pointing to paths for more accurate systems. A reader would value the organization because it clarifies how different techniques relate and what gaps remain for practical document intelligence.

Core claim

The paper establishes a systematic taxonomy that classifies document parsing approaches into modular pipeline-based systems, which decompose tasks such as layout analysis and recognition of heterogeneous content including text, tables, mathematical expressions and visual elements, and unified models driven by Vision-Language Models, while also reviewing the evolution of those VLMs, widely used evaluation metrics, high-quality benchmarks, and remaining challenges in robustness, reliability, and efficiency.

What carries the argument

The taxonomy that divides existing approaches into modular pipeline-based systems and unified VLM-driven models

If this is right

Pipeline systems support targeted improvements in individual stages such as layout analysis and content recognition.
Unified VLM models enable end-to-end parsing that handles complex document structures without separate modules.
Standardized benchmarks and metrics allow consistent comparison of parsing quality across methods.
Resolving challenges in robustness to complex layouts and VLM reliability will support more scalable document intelligence systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy may help researchers decide whether to refine separate modules or invest in larger unified models for specific applications.
Widespread adoption of the VLM route could reduce reliance on hand-crafted pipeline stages over time.
Improved parsing efficiency would directly benefit downstream systems that ingest large document collections.
Future work could test whether hybrid approaches that combine pipeline modularity with VLM capabilities outperform either category alone.

Load-bearing premise

The body of literature selected for review is representative of the field and the proposed taxonomy captures the primary distinctions between approaches without major omissions or overlaps.

What would settle it

Discovery of a substantial body of document parsing methods that cannot be placed into either the pipeline category or the unified VLM category without forcing significant overlap or omission.

Figures

Figures reproduced from arXiv: 2410.21169 by Bin Wang, Conghui He, Hao Liang, Junyuan Zhang, Qintong Zhang, Victor Shea-Jay Huang, Wentao Zhang, Zhengren Wang.

**Figure 2.** Figure 2: Two Methodology of Document Parsing. 2.1 Document Parsing System 2.1.1 Layout Analysis. Layout detection identifies structural elements of a document—such as text blocks, paragraphs, headings, images, tables, and mathematical expressions—along with their spatial coordinates and reading order. This foundational step is crucial for accurate content extraction. Mathematical expressions, especially inline ones… view at source ↗

**Figure 3.** Figure 3: Overview of the Document Layout Analysis. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the Optical Character Recognition. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the Mathematical Expression Detection and Recognition. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the Table Detection and Recognition. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of the Chart-related Tasks in Document. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG). This survey provides a comprehensive and timely review of document parsing research. We propose a systematic taxonomy that organizes existing approaches into modular pipeline-based systems and unified models driven by Vision-Language Models (VLMs). We provide a detailed review of key components in pipeline systems, including layout analysis and the recognition of heterogeneous content such as text, tables, mathematical expressions, and visual elements, and then systematically track the evolution of specialized VLMs for document parsing. Additionally, we summarize widely adopted evaluation metrics and high-quality benchmarks that establish current standards for parsing quality. Finally, we discuss key open challenges, including robustness to complex layouts, reliability of VLM-based parsing, and inference efficiency, and outline directions for building more accurate and scalable document intelligence systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes document parsing work into pipeline systems versus VLM models, reviews components and benchmarks, but adds no new methods or findings.

read the letter

This survey organizes document parsing work into pipeline systems versus VLM models, reviews components and benchmarks, but adds no new methods or findings. It pulls together layout analysis, recognition of text/tables/math/visuals, the shift to specialized VLMs, common metrics, and standard benchmarks in one place. That kind of compilation can save time for someone who needs to understand current options for turning documents into structured data for RAG or knowledge bases. The breakdown of pipeline components is straightforward and practical. The section on VLM evolution tracks how these models have been applied to parsing tasks without forcing unrelated ideas together. Summarizing evaluation standards is the most immediately usable part for practitioners. The taxonomy itself follows the usual split between modular and end-to-end approaches, so it does not create unexpected connections or reduce the problem in a new way. As with most surveys, the real test is whether the selected papers are representative and whether any cited work is described inaccurately, but the abstract gives no sign of major internal contradictions. The listed challenges (robustness, VLM reliability, efficiency) are already widely discussed, and the paper does not appear to offer fresh resolutions or measurements. This paper is mainly for readers entering the area or needing a reference list and overview rather than for specialists already working on new parsing techniques. It deserves peer review because a clear, balanced survey can still be a useful community resource even without original experiments or theory.

Referee Report

1 major / 3 minor

Summary. This survey reviews document parsing techniques that convert unstructured or semi-structured documents into machine-readable structured representations for downstream tasks such as knowledge base construction and retrieval-augmented generation. It proposes a taxonomy dividing existing methods into modular pipeline-based systems and unified models based on Vision-Language Models (VLMs). The paper reviews pipeline components including layout analysis and recognition of text, tables, mathematical expressions, and visual elements; tracks the development of specialized VLMs for document parsing; summarizes evaluation metrics and benchmarks; and discusses challenges such as robustness to complex layouts, VLM reliability, and inference efficiency, along with future research directions.

Significance. If the taxonomy is well-justified and the literature coverage representative, the survey offers a timely organizational framework for a fast-moving area, particularly the transition from pipelines to VLM-centric approaches. It can serve as a reference point for identifying gaps in robustness and scalability, aiding researchers working on document intelligence systems.

major comments (1)

[Abstract] Abstract: The central claim that the work provides a 'comprehensive' review and 'systematic taxonomy' is load-bearing for the paper's contribution, yet the abstract provides no details on literature search methodology, inclusion criteria, or time period covered; this makes it impossible to evaluate whether the taxonomy omits major lines of work or contains unacknowledged overlaps between the two categories.

minor comments (3)

[Taxonomy] The taxonomy description would be clearer with an explicit figure or table contrasting the two categories (modular pipelines vs. unified VLMs) and their sub-components.
[Challenges] The challenges section would benefit from citing specific quantitative results (e.g., error rates or failure modes on named benchmarks) to ground the discussion of robustness and reliability limitations.
[Evaluation Metrics and Benchmarks] Ensure all cited benchmarks and metrics are accompanied by references to their original papers to allow readers to trace the evaluation standards.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment and constructive feedback on our survey. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the work provides a 'comprehensive' review and 'systematic taxonomy' is load-bearing for the paper's contribution, yet the abstract provides no details on literature search methodology, inclusion criteria, or time period covered; this makes it impossible to evaluate whether the taxonomy omits major lines of work or contains unacknowledged overlaps between the two categories.

Authors: We agree that the abstract should provide greater transparency regarding the literature review process to support the claims of comprehensiveness and systematic organization. The taxonomy in the manuscript was constructed by surveying peer-reviewed works primarily from 2018 onward in venues such as CVPR, ICCV, NeurIPS, ACL, and related journals, with inclusion focused on methods addressing layout analysis, content recognition, and VLM-based parsing; overlaps between pipeline and VLM categories are explicitly discussed in Section 3. To address the referee's point, we will revise the abstract to include a concise statement on the search methodology, inclusion criteria, and covered time period. revision: yes

Circularity Check

0 steps flagged

Survey paper with no derivations or predictions exhibits no circularity

full rationale

This is a literature survey proposing a descriptive taxonomy that organizes prior document parsing work into modular pipeline-based systems versus unified VLM-driven models. It reviews components, metrics, benchmarks, and challenges without any equations, quantitative predictions, fitted parameters, or novel derivations. The central claim is an organizational framework for existing literature rather than a result derived from self-referential assumptions or self-citations; the representativeness of selected papers is a standard survey concern external to any internal derivation chain. No load-bearing steps reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a literature survey the central contribution rests on selection and synthesis of prior publications rather than new axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5714 in / 952 out tokens · 27530 ms · 2026-05-23T19:13:14.207662+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A document is worth a structured record: Principled inductive bias design for document recognition
cs.CV 2025-07 unverdicted novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, ...
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
cs.CL 2026-05 unverdicted novelty 6.0

CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
cs.CV 2025-09 unverdicted novelty 6.0

MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop
cs.AI 2026-05 conditional novelty 4.0

MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.
RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering
cs.IR 2026-03 unverdicted novelty 4.0

RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 6 Pith papers · 15 internal anchors

[1]

Abdelrahman Abdallah, Daniel Eberharter, Zoe Pfister, and Adam Jatowt. 2024. Transformers and language models in form understanding: A comprehensive review of scanned document analysis. arXiv preprint arXiv:2403.04080 (2024)

work page arXiv 2024
[2]

Ridhi Aggarwal, Shilpa Pandey, Anil Kumar Tiwari, and Gaurav Harit. 2022. Survey of mathematical expression recognition for printed and handwritten documents. IETE Technical Review 39, 6 (2022), 1245–1253

work page 2022
[3]

Md Mutasim Billah Abu Noman Akanda, Maruf Ahmed, AKM Shahariar Azad Rabby, and Fuad Rahman. 2024. Optimum Deep Learning Method for Document Layout Analysis in Low Resource Languages. In Proceedings of the 2024 ACM Southeast Conference . 199–204

work page 2024
[4]

Rabah Al-Zaidy and C Giles. 2017. A machine learning approach for semantic structuring of scientific charts in scholarly documents. InProceedings of the AAAI Conference on Artificial Intelligence , Vol. 31. 4644–4649

work page 2017
[5]

Robert H Anderson. 1967. Syntax-directed recognition of hand-printed two-dimensional mathematics. In Symposium on interactive systems for experimental applied mathematics: Proceedings of the Association for Computing Machinery Inc. Symposium . 436–459

work page 1967
[6]

Dan Anitei, Joan Andreu Sánchez, José Manuel Fuentes, Roberto Paredes, and José Miguel Benedí. 2021. ICDAR 2021 competition on mathematical formula detection. In International Conference on Document Analysis and Recognition . Springer, 783–795

work page 2021
[7]

Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan Pletschacher. 2009. A realistic dataset for performance evaluation of document layout analysis. In 2009 10th International Conference on Document Analysis and Recognition . IEEE, 296–300

work page 2009
[8]

Emilia Apostolova, Daekeun You, Zhiyun Xue, Sameer Antani, Dina Demner-Fushman, and George R Thoma. 2013. Image retrieval from scientific publications: Text and image content processing to separate multipanel figures. Journal of the American Society for Information Science and Technology 64, 5 (2013), 893–908

work page 2013
[9]

Tiago Araújo, Paulo Chagas, Joao Alves, Carlos Santos, Beatriz Sousa Santos, and Bianchi Serique Meiguins. 2020. A real-world approach on the problem of chart recognition using classification, detection and perspective correction. Sensors 20, 16 (2020), 4370

work page 2020
[10]

Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 6679–6687

work page 2021
[11]

Rowel Atienza. 2021. Vision transformer for fast and efficient scene text recognition. InInternational conference on document analysis and recognition. Springer, 319–334

work page 2021
[12]

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 9365–9374

work page 2019
[13]

Youngmin Baek, Seung Shin, Jeonghun Baek, Sungrae Park, Junyeop Lee, Daehyun Nam, and Hwalsuk Lee. 2020. Character region attention for text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16 . Springer, 504–521

work page 2020
[14]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. (2023)

work page 2023
[15]

Ayan Banerjee, Sanket Biswas, Josep Lladós, and Umapada Pal. 2024. SemiDocSeg: harnessing semi-supervised learning for document layout analysis. International Journal on Document Analysis and Recognition (IJDAR) (2024), 1–18

work page 2024
[16]

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Darwin Bautista and Rowel Atienza. 2022. Scene text recognition with permuted autoregressive sequence models. In European conference on computer vision. Springer, 178–196

work page 2022
[18]

Dipali Baviskar, Swati Ahirrao, Vidyasagar Potdar, and Ketan Kotecha. 2021. Efficient automated processing of the unstructured documents using artificial intelligence: A systematic literature review and future directions. IEEE Access 9 (2021), 72894–72936

work page 2021
[19]

Galal M Binmakhashen and Sabri A Mahmoud. 2019. Document layout analysis: a comprehensive survey. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–36

work page 2019
[20]

Lukas Blecher. 2022. pix2tex - LaTeX OCR. https://github.com/lukas-blecher/LaTeX-OCR. Accessed: 2024-2-29

work page 2022
[21]

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2023. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE international conference on computer vision . 2204–2212

work page 2017
[23]

Céres Carton, Aurélie Lemaitre, and Bertrand Coüasnon. 2013. Fusion of statistical and structural information for flowchart recognition. In 2013 12th International Conference on Document Analysis and Recognition . IEEE, 1210–1214

work page 2013
[24]

Paulo Chagas, Rafael Akiyama, Aruanda Meiguins, Carlos Santos, Filipe Saraiva, Bianchi Meiguins, and Jefferson Morais. 2018. Evaluation of convolutional neural network architectures for chart image classification. In 2018 International Joint Conference on Neural Networks (IJCNN) . IEEE, 1–8

work page 2018
[25]

Chungkwong Chan. 2020. Stroke extraction for offline handwritten mathematical expression recognition. IEEE Access 8 (2020), 61565–61575

work page 2020
[26]

Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. 2024. OneChart: Purify the Chart Structural Extraction via One Auxiliary Token. arXiv preprint arXiv:2404.09987 (2024)

work page arXiv 2024
[27]

Jingye Chen, Bin Li, and Xiangyang Xue. 2021. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 12026–12035. Manuscript submitted to ACM 24 Zhang et al

work page 2021
[28]

Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan Li. 2023. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571 (2023)

work page arXiv 2023
[29]

Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollár. 2019. Tensormask: A foundation for dense object segmentation. In Proceedings of the IEEE/CVF international conference on computer vision . 2061–2069

work page 2019
[30]

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198

work page 2024
[32]

Beibei Cheng, Sameer Antani, R Joe Stanley, and George R Thoma. 2011. Automatic segmentation of subfigure image panels for multimodal biomedical document retrieval. In Document Recognition and Retrieval XVIII , Vol. 7874. SPIE, 294–304

work page 2011
[33]

Hiuyi Cheng, Peirong Zhang, Sihang Wu, Jiaxin Zhang, Qiyuan Zhu, Zecheng Xie, Jing Li, Kai Ding, and Lianwen Jin. 2023. M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi-annotation category dataset for modern document layout analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 15...

work page 2023
[34]

Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Aon: Towards arbitrarily-oriented text recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 5571–5579

work page 2018
[35]

Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. 2019. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019)

work page arXiv 2019
[36]

Hugh A Chipman, Edward I George, Robert E McCulloch, and Thomas S Shively. 2022. mBART: multidimensional monotone BART. Bayesian Analysis 17, 2 (2022), 515–544

work page 2022
[37]

Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In2017 14th IAPR international conference on document analysis and recognition (ICDAR) , Vol. 1. IEEE, 935–942

work page 2017
[38]

Sagnik Ray Choudhury, Shuting Wang, and C Lee Giles. 2016. Scalable algorithms for scholarly figure mining and semantics. In Proceedings of the International Workshop on Semantic Big Data . 1–6

work page 2016
[39]

Sagnik Ray Choudhury, Shuting Wang, Prasenjit Mitra, and C Lee Giles. 2015. Automated data extraction from scholarly line graphs. In Proc. Int. Workshop Graph. Recognit

work page 2015
[40]

Mathieu Cliche, David Rosenberg, Dhruv Madeka, and Connie Yee. 2017. Scatteract: Automated extraction of data from scatter plots. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part I 10. Springer, 135–150

work page 2017
[41]

Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. 2023. Vision grid transformer for document layout analysis. In Proceedings of the IEEE/CVF international conference on computer vision . 19462–19472

work page 2023
[42]

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems 29 (2016)

work page 2016
[43]

Wenjing Dai, Meng Wang, Zhibin Niu, and Jiawan Zhang. 2018. Chart decoder: Generating textual and numeric information from chart images automatically. Journal of Visual Languages & Computing 48 (2018), 101–109

work page 2018
[44]

Kenny Davila, Bhargava Urala Kota, Srirangaraj Setlur, Venu Govindaraju, Christopher Tensmeyer, Sumit Shekhar, and Ritwick Chaudhry. 2019. ICDAR 2019 competition on harvesting raw tables from infographics (chart-infographics). In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1594–1599

work page 2019
[45]

Kenny Davila, Srirangaraj Setlur, David Doermann, Bhargava Urala Kota, and Venu Govindaraju. 2020. Chart mining: A survey of methods for automated chart analysis. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 3799–3819

work page 2020
[46]

Kenny Davila, Chris Tensmeyer, Sumit Shekhar, Hrituraj Singh, Srirangaraj Setlur, and Venu Govindaraju. 2021. ICPR 2020-competition on harvesting raw tables from infographics. In International Conference on Pattern Recognition . Springer, 361–380

work page 2021
[47]

Kenny Davila, Fei Xu, Saleem Ahmed, David A Mendoza, Srirangaraj Setlur, and Venu Govindaraju. 2022. Icpr 2022: Challenge on harvesting raw tables from infographics (chart-infographics). In 2022 26th International Conference on Pattern Recognition (ICPR) . IEEE, 4995–5001

work page 2022
[48]

Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. Pixellink: Detecting scene text via instance segmentation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018
[49]

Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. 2017. generation with coarse-to-fine attention. In International Conference on Machine Learning. PMLR, 980–989

work page 2017
[50]

Yuntian Deng, David Rosenberg, and Gideon Mann. 2019. Challenges in end-to-end neural scientific table recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, 894–901

work page 2019
[51]

Timo I Denk and Christian Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding. arXiv preprint arXiv:1909.04948 (2019)

work page arXiv 2019
[52]

Harsh Desai, Pratik Kayal, and Mayank Singh. 2021. TabLeX: a benchmark dataset for structure and content information extraction from scientific tables. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16. Springer, 554–569. Manuscript submitted to ACM Document...

work page 2021
[53]

Anurag Dhote, Mohammed Javed, and David S Doermann. 2023. A survey and approach to chart classification. In International Conference on Document Analysis and Recognition. Springer, 67–82

work page 2023
[54]

Anurag Dhote, Mohammed Javed, and David S Doermann. 2024. Swin-chart: An efficient approach for chart classification. Pattern Recognition Letters 185 (2024), 203–209

work page 2024
[55]

Daniel Drevon, Sophie R Fursa, and Allura L Malcolm. 2017. Intercoder reliability and validity of WebPlotDigitizer in extracting graphed data. Behavior modification 41, 2 (2017), 323–339

work page 2017
[56]

Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. 2023. Context perception parallel decoder for scene text recognition. arXiv preprint arXiv:2307.12270 (2023)

work page arXiv 2023
[57]

Randa Elanwar, Wenda Qin, Margrit Betke, and Derry Wijaya. 2021. Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model. International Journal on Document Analysis and Recognition (IJDAR) 24, 4 (2021), 349–362

work page 2021
[58]

Jing Fang, Xin Tao, Zhi Tang, Ruiheng Qiu, and Ying Liu. 2012. Dataset, ground-truth and performance metrics for table detection evaluation. In 2012 10th IAPR International Workshop on Document Analysis Systems . IEEE, 445–449

work page 2012
[59]

Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 7098–7107

work page 2021
[60]

Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang Li, and Can Huang. 2023. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arXiv preprint arXiv:2311.11810 (2023)

work page arXiv 2023
[61]

Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF international conference on computer vision . 9076–9085

work page 2019
[62]

Jinglun Gao, Yin Zhou, and Kenneth E Barner. 2012. View: Visual information extraction widget for improving chart images accessibility. In 2012 19th IEEE international conference on image processing . IEEE, 2865–2868

work page 2012
[63]

Liangcai Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and Eva Lang. 2019. ICDAR 2019 competition on table detection and recognition (cTDaR). In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, 1510–1515

work page 2019
[64]

Liangcai Gao, Xiaohan Yi, Zhuoren Jiang, Leipeng Hao, and Zhi Tang. 2017. ICDAR2017 competition on page object detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 1417–1422

work page 2017
[65]

Liangcai Gao, Xiaohan Yi, Yuan Liao, Zhuoren Jiang, Zuoyu Yan, and Zhi Tang. 2017. A deep learning-based formula detection method for PDF documents. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 553–558

work page 2017
[66]

Azka Gilani, Shah Rukh Qasim, Imran Malik, and Faisal Shafait. 2017. Table detection using deep learning. In2017 14th IAPR international conference on document analysis and recognition (ICDAR) , Vol. 1. IEEE, 771–776

work page 2017
[67]

Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. 2013. ICDAR 2013 table competition. In 2013 12th international conference on document analysis and recognition. IEEE, 1449–1453

work page 2013
[68]

Adel Got, Djaafar Zouache, Abdelouahab Moussaoui, Laith Abualigah, and Ahmed Alsayat. 2024. Improved manta ray foraging optimizer-based SVM for feature selection problems: a medical case study. Journal of Bionic Engineering 21, 1 (2024), 409–425

work page 2024
[69]

Tobias Grüning, Gundram Leifert, Tobias Strauß, Johannes Michael, and Roger Labahn. 2019. A two-stage method for text line detection in historical documents. International Journal on Document Analysis and Recognition (IJDAR) 22, 3 (2019), 285–302

work page 2019
[70]

Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Nikolaos Barmpalios, Ani Nenkova, and Tong Sun. 2021. Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems 34 (2021), 39–50

work page 2021
[71]

Zengyuan Guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, and Jingdong Wang. 2022. Trust: An accurate and end-to-end table structure recognizer using splitting-based transformers. arXiv preprint arXiv:2208.14687 (2022)

work page arXiv 2022
[72]

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition . 2315–2324

work page 2016
[73]

Jan Hajič and Pavel Pecina. 2017. The MUSCIMA++ dataset for handwritten optical music recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 39–46

work page 2017
[74]

Mrinal Haloi, Shashank Shekhar, Nikhil Fande, Siddhant Swaroop Dash, et al. 2022. Table Detection in the Wild: A Novel Diverse Table Detection Dataset and Method. arXiv preprint arXiv:2209.09207 (2022)

work page arXiv 2022
[75]

Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. 2023. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483 (2023)

work page arXiv 2023
[76]

Leipeng Hao, Liangcai Gao, Xiaohan Yi, and Zhi Tang. 2016. A table detection method for pdf documents based on convolutional neural networks. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) . IEEE, 287–292

work page 2016
[77]

Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. 2021. Cascade network with deformable composite backbone for formula detection in scanned document images. Applied Sciences 11, 16 (2021), 7610

work page 2021
[78]

Muhammad Yusuf Hassan, Mayank Singh, et al. 2023. Lineex: data extraction from scientific line charts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . 6213–6221

work page 2023
[79]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 16000–16009

work page 2022
[80]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778. Manuscript submitted to ACM 26 Zhang et al

work page 2016

Showing first 80 references.

[1] [1]

Abdelrahman Abdallah, Daniel Eberharter, Zoe Pfister, and Adam Jatowt. 2024. Transformers and language models in form understanding: A comprehensive review of scanned document analysis. arXiv preprint arXiv:2403.04080 (2024)

work page arXiv 2024

[2] [2]

Ridhi Aggarwal, Shilpa Pandey, Anil Kumar Tiwari, and Gaurav Harit. 2022. Survey of mathematical expression recognition for printed and handwritten documents. IETE Technical Review 39, 6 (2022), 1245–1253

work page 2022

[3] [3]

Md Mutasim Billah Abu Noman Akanda, Maruf Ahmed, AKM Shahariar Azad Rabby, and Fuad Rahman. 2024. Optimum Deep Learning Method for Document Layout Analysis in Low Resource Languages. In Proceedings of the 2024 ACM Southeast Conference . 199–204

work page 2024

[4] [4]

Rabah Al-Zaidy and C Giles. 2017. A machine learning approach for semantic structuring of scientific charts in scholarly documents. InProceedings of the AAAI Conference on Artificial Intelligence , Vol. 31. 4644–4649

work page 2017

[5] [5]

Robert H Anderson. 1967. Syntax-directed recognition of hand-printed two-dimensional mathematics. In Symposium on interactive systems for experimental applied mathematics: Proceedings of the Association for Computing Machinery Inc. Symposium . 436–459

work page 1967

[6] [6]

Dan Anitei, Joan Andreu Sánchez, José Manuel Fuentes, Roberto Paredes, and José Miguel Benedí. 2021. ICDAR 2021 competition on mathematical formula detection. In International Conference on Document Analysis and Recognition . Springer, 783–795

work page 2021

[7] [7]

Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan Pletschacher. 2009. A realistic dataset for performance evaluation of document layout analysis. In 2009 10th International Conference on Document Analysis and Recognition . IEEE, 296–300

work page 2009

[8] [8]

Emilia Apostolova, Daekeun You, Zhiyun Xue, Sameer Antani, Dina Demner-Fushman, and George R Thoma. 2013. Image retrieval from scientific publications: Text and image content processing to separate multipanel figures. Journal of the American Society for Information Science and Technology 64, 5 (2013), 893–908

work page 2013

[9] [9]

Tiago Araújo, Paulo Chagas, Joao Alves, Carlos Santos, Beatriz Sousa Santos, and Bianchi Serique Meiguins. 2020. A real-world approach on the problem of chart recognition using classification, detection and perspective correction. Sensors 20, 16 (2020), 4370

work page 2020

[10] [10]

Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 6679–6687

work page 2021

[11] [11]

Rowel Atienza. 2021. Vision transformer for fast and efficient scene text recognition. InInternational conference on document analysis and recognition. Springer, 319–334

work page 2021

[12] [12]

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 9365–9374

work page 2019

[13] [13]

Youngmin Baek, Seung Shin, Jeonghun Baek, Sungrae Park, Junyeop Lee, Daehyun Nam, and Hwalsuk Lee. 2020. Character region attention for text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16 . Springer, 504–521

work page 2020

[14] [14]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. (2023)

work page 2023

[15] [15]

Ayan Banerjee, Sanket Biswas, Josep Lladós, and Umapada Pal. 2024. SemiDocSeg: harnessing semi-supervised learning for document layout analysis. International Journal on Document Analysis and Recognition (IJDAR) (2024), 1–18

work page 2024

[16] [16]

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

Darwin Bautista and Rowel Atienza. 2022. Scene text recognition with permuted autoregressive sequence models. In European conference on computer vision. Springer, 178–196

work page 2022

[18] [18]

Dipali Baviskar, Swati Ahirrao, Vidyasagar Potdar, and Ketan Kotecha. 2021. Efficient automated processing of the unstructured documents using artificial intelligence: A systematic literature review and future directions. IEEE Access 9 (2021), 72894–72936

work page 2021

[19] [19]

Galal M Binmakhashen and Sabri A Mahmoud. 2019. Document layout analysis: a comprehensive survey. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–36

work page 2019

[20] [20]

Lukas Blecher. 2022. pix2tex - LaTeX OCR. https://github.com/lukas-blecher/LaTeX-OCR. Accessed: 2024-2-29

work page 2022

[21] [21]

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2023. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE international conference on computer vision . 2204–2212

work page 2017

[23] [23]

Céres Carton, Aurélie Lemaitre, and Bertrand Coüasnon. 2013. Fusion of statistical and structural information for flowchart recognition. In 2013 12th International Conference on Document Analysis and Recognition . IEEE, 1210–1214

work page 2013

[24] [24]

Paulo Chagas, Rafael Akiyama, Aruanda Meiguins, Carlos Santos, Filipe Saraiva, Bianchi Meiguins, and Jefferson Morais. 2018. Evaluation of convolutional neural network architectures for chart image classification. In 2018 International Joint Conference on Neural Networks (IJCNN) . IEEE, 1–8

work page 2018

[25] [25]

Chungkwong Chan. 2020. Stroke extraction for offline handwritten mathematical expression recognition. IEEE Access 8 (2020), 61565–61575

work page 2020

[26] [26]

Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. 2024. OneChart: Purify the Chart Structural Extraction via One Auxiliary Token. arXiv preprint arXiv:2404.09987 (2024)

work page arXiv 2024

[27] [27]

Jingye Chen, Bin Li, and Xiangyang Xue. 2021. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 12026–12035. Manuscript submitted to ACM 24 Zhang et al

work page 2021

[28] [28]

Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan Li. 2023. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571 (2023)

work page arXiv 2023

[29] [29]

Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollár. 2019. Tensormask: A foundation for dense object segmentation. In Proceedings of the IEEE/CVF international conference on computer vision . 2061–2069

work page 2019

[30] [30]

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198

work page 2024

[32] [32]

Beibei Cheng, Sameer Antani, R Joe Stanley, and George R Thoma. 2011. Automatic segmentation of subfigure image panels for multimodal biomedical document retrieval. In Document Recognition and Retrieval XVIII , Vol. 7874. SPIE, 294–304

work page 2011

[33] [33]

Hiuyi Cheng, Peirong Zhang, Sihang Wu, Jiaxin Zhang, Qiyuan Zhu, Zecheng Xie, Jing Li, Kai Ding, and Lianwen Jin. 2023. M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi-annotation category dataset for modern document layout analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 15...

work page 2023

[34] [34]

Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Aon: Towards arbitrarily-oriented text recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 5571–5579

work page 2018

[35] [35]

Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. 2019. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019)

work page arXiv 2019

[36] [36]

Hugh A Chipman, Edward I George, Robert E McCulloch, and Thomas S Shively. 2022. mBART: multidimensional monotone BART. Bayesian Analysis 17, 2 (2022), 515–544

work page 2022

[37] [37]

Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In2017 14th IAPR international conference on document analysis and recognition (ICDAR) , Vol. 1. IEEE, 935–942

work page 2017

[38] [38]

Sagnik Ray Choudhury, Shuting Wang, and C Lee Giles. 2016. Scalable algorithms for scholarly figure mining and semantics. In Proceedings of the International Workshop on Semantic Big Data . 1–6

work page 2016

[39] [39]

Sagnik Ray Choudhury, Shuting Wang, Prasenjit Mitra, and C Lee Giles. 2015. Automated data extraction from scholarly line graphs. In Proc. Int. Workshop Graph. Recognit

work page 2015

[40] [40]

Mathieu Cliche, David Rosenberg, Dhruv Madeka, and Connie Yee. 2017. Scatteract: Automated extraction of data from scatter plots. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part I 10. Springer, 135–150

work page 2017

[41] [41]

Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. 2023. Vision grid transformer for document layout analysis. In Proceedings of the IEEE/CVF international conference on computer vision . 19462–19472

work page 2023

[42] [42]

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems 29 (2016)

work page 2016

[43] [43]

Wenjing Dai, Meng Wang, Zhibin Niu, and Jiawan Zhang. 2018. Chart decoder: Generating textual and numeric information from chart images automatically. Journal of Visual Languages & Computing 48 (2018), 101–109

work page 2018

[44] [44]

Kenny Davila, Bhargava Urala Kota, Srirangaraj Setlur, Venu Govindaraju, Christopher Tensmeyer, Sumit Shekhar, and Ritwick Chaudhry. 2019. ICDAR 2019 competition on harvesting raw tables from infographics (chart-infographics). In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1594–1599

work page 2019

[45] [45]

Kenny Davila, Srirangaraj Setlur, David Doermann, Bhargava Urala Kota, and Venu Govindaraju. 2020. Chart mining: A survey of methods for automated chart analysis. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 3799–3819

work page 2020

[46] [46]

Kenny Davila, Chris Tensmeyer, Sumit Shekhar, Hrituraj Singh, Srirangaraj Setlur, and Venu Govindaraju. 2021. ICPR 2020-competition on harvesting raw tables from infographics. In International Conference on Pattern Recognition . Springer, 361–380

work page 2021

[47] [47]

Kenny Davila, Fei Xu, Saleem Ahmed, David A Mendoza, Srirangaraj Setlur, and Venu Govindaraju. 2022. Icpr 2022: Challenge on harvesting raw tables from infographics (chart-infographics). In 2022 26th International Conference on Pattern Recognition (ICPR) . IEEE, 4995–5001

work page 2022

[48] [48]

Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. Pixellink: Detecting scene text via instance segmentation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018

[49] [49]

Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. 2017. generation with coarse-to-fine attention. In International Conference on Machine Learning. PMLR, 980–989

work page 2017

[50] [50]

Yuntian Deng, David Rosenberg, and Gideon Mann. 2019. Challenges in end-to-end neural scientific table recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, 894–901

work page 2019

[51] [51]

Timo I Denk and Christian Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding. arXiv preprint arXiv:1909.04948 (2019)

work page arXiv 2019

[52] [52]

Harsh Desai, Pratik Kayal, and Mayank Singh. 2021. TabLeX: a benchmark dataset for structure and content information extraction from scientific tables. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16. Springer, 554–569. Manuscript submitted to ACM Document...

work page 2021

[53] [53]

Anurag Dhote, Mohammed Javed, and David S Doermann. 2023. A survey and approach to chart classification. In International Conference on Document Analysis and Recognition. Springer, 67–82

work page 2023

[54] [54]

Anurag Dhote, Mohammed Javed, and David S Doermann. 2024. Swin-chart: An efficient approach for chart classification. Pattern Recognition Letters 185 (2024), 203–209

work page 2024

[55] [55]

Daniel Drevon, Sophie R Fursa, and Allura L Malcolm. 2017. Intercoder reliability and validity of WebPlotDigitizer in extracting graphed data. Behavior modification 41, 2 (2017), 323–339

work page 2017

[56] [56]

Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. 2023. Context perception parallel decoder for scene text recognition. arXiv preprint arXiv:2307.12270 (2023)

work page arXiv 2023

[57] [57]

Randa Elanwar, Wenda Qin, Margrit Betke, and Derry Wijaya. 2021. Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model. International Journal on Document Analysis and Recognition (IJDAR) 24, 4 (2021), 349–362

work page 2021

[58] [58]

Jing Fang, Xin Tao, Zhi Tang, Ruiheng Qiu, and Ying Liu. 2012. Dataset, ground-truth and performance metrics for table detection evaluation. In 2012 10th IAPR International Workshop on Document Analysis Systems . IEEE, 445–449

work page 2012

[59] [59]

Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 7098–7107

work page 2021

[60] [60]

Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang Li, and Can Huang. 2023. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arXiv preprint arXiv:2311.11810 (2023)

work page arXiv 2023

[61] [61]

Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF international conference on computer vision . 9076–9085

work page 2019

[62] [62]

Jinglun Gao, Yin Zhou, and Kenneth E Barner. 2012. View: Visual information extraction widget for improving chart images accessibility. In 2012 19th IEEE international conference on image processing . IEEE, 2865–2868

work page 2012

[63] [63]

Liangcai Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and Eva Lang. 2019. ICDAR 2019 competition on table detection and recognition (cTDaR). In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, 1510–1515

work page 2019

[64] [64]

Liangcai Gao, Xiaohan Yi, Zhuoren Jiang, Leipeng Hao, and Zhi Tang. 2017. ICDAR2017 competition on page object detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 1417–1422

work page 2017

[65] [65]

Liangcai Gao, Xiaohan Yi, Yuan Liao, Zhuoren Jiang, Zuoyu Yan, and Zhi Tang. 2017. A deep learning-based formula detection method for PDF documents. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 553–558

work page 2017

[66] [66]

Azka Gilani, Shah Rukh Qasim, Imran Malik, and Faisal Shafait. 2017. Table detection using deep learning. In2017 14th IAPR international conference on document analysis and recognition (ICDAR) , Vol. 1. IEEE, 771–776

work page 2017

[67] [67]

Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. 2013. ICDAR 2013 table competition. In 2013 12th international conference on document analysis and recognition. IEEE, 1449–1453

work page 2013

[68] [68]

Adel Got, Djaafar Zouache, Abdelouahab Moussaoui, Laith Abualigah, and Ahmed Alsayat. 2024. Improved manta ray foraging optimizer-based SVM for feature selection problems: a medical case study. Journal of Bionic Engineering 21, 1 (2024), 409–425

work page 2024

[69] [69]

Tobias Grüning, Gundram Leifert, Tobias Strauß, Johannes Michael, and Roger Labahn. 2019. A two-stage method for text line detection in historical documents. International Journal on Document Analysis and Recognition (IJDAR) 22, 3 (2019), 285–302

work page 2019

[70] [70]

Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Nikolaos Barmpalios, Ani Nenkova, and Tong Sun. 2021. Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems 34 (2021), 39–50

work page 2021

[71] [71]

Zengyuan Guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, and Jingdong Wang. 2022. Trust: An accurate and end-to-end table structure recognizer using splitting-based transformers. arXiv preprint arXiv:2208.14687 (2022)

work page arXiv 2022

[72] [72]

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition . 2315–2324

work page 2016

[73] [73]

Jan Hajič and Pavel Pecina. 2017. The MUSCIMA++ dataset for handwritten optical music recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1. IEEE, 39–46

work page 2017

[74] [74]

Mrinal Haloi, Shashank Shekhar, Nikhil Fande, Siddhant Swaroop Dash, et al. 2022. Table Detection in the Wild: A Novel Diverse Table Detection Dataset and Method. arXiv preprint arXiv:2209.09207 (2022)

work page arXiv 2022

[75] [75]

Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. 2023. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483 (2023)

work page arXiv 2023

[76] [76]

Leipeng Hao, Liangcai Gao, Xiaohan Yi, and Zhi Tang. 2016. A table detection method for pdf documents based on convolutional neural networks. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) . IEEE, 287–292

work page 2016

[77] [77]

Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. 2021. Cascade network with deformable composite backbone for formula detection in scanned document images. Applied Sciences 11, 16 (2021), 7610

work page 2021

[78] [78]

Muhammad Yusuf Hassan, Mayank Singh, et al. 2023. Lineex: data extraction from scientific line charts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . 6213–6221

work page 2023

[79] [79]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 16000–16009

work page 2022

[80] [80]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778. Manuscript submitted to ACM 26 Zhang et al

work page 2016