Lightweight and Production-Ready PDF Visual Element Parsing

Matthew Rowe; Meizhu Liu; Michael Avendi; Paul Li; Yassi Abbasi

arxiv: 2604.23276 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.AI· cs.CL

Lightweight and Production-Ready PDF Visual Element Parsing

Meizhu Liu , Yassi Abbasi , Matthew Rowe , Michael Avendi , Paul Li This is my paper

Pith reviewed 2026-05-08 08:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords PDF parsingvisual element detectioncaption associationdocument understandingmultimodal RAGlayout analysissemantic similarityproduction deployment

0 comments

The pith

A lightweight PDF parsing framework detects visual elements with at least 96 percent accuracy and associates captions at 93 percent accuracy using spatial heuristics, layout analysis, and semantic similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PDF documents contain figures, tables, and forms whose accurate extraction matters for document understanding and retrieval-augmented generation, yet many existing parsers miss elements, pull in artifacts, or fail to match captions reliably. This paper introduces a production-level system that locates visual elements through spatial position rules, organizes content via layout structure, and links captions by measuring semantic similarity. The approach records at least 96 percent detection accuracy and 93 percent caption association on standard benchmarks plus internal company documents. When inserted as a preprocessing stage, the parser improves multimodal RAG results over current tools and large vision-language models while cutting latency by more than half, and the system now runs in live production.

Core claim

The paper shows that combining spatial heuristics for bounding-box placement, layout analysis for structural grouping, and semantic similarity for caption matching produces a lightweight parser that extracts figures, tables, and forms from PDFs with at least 96 percent detection accuracy and 93 percent correct caption association, yielding stronger multimodal RAG performance and over 2 times lower latency than prior parsers or large vision-language models on both public benchmarks and internal data.

What carries the argument

Integration of spatial heuristics for element location, layout analysis for content organization, and semantic similarity scoring for caption-to-element pairing.

If this is right

Multimodal RAG pipelines retrieve higher-quality visual content because elements arrive complete and correctly paired with captions.
Document processing pipelines for large collections run more than twice as fast.
Production systems avoid extracting non-informative artifacts such as watermarks or logos.
Downstream question-answering accuracy over PDF collections rises from cleaner input data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spatial-plus-semantic pipeline could be adapted to extract elements from scanned image collections or HTML documents without major redesign.
Adding domain-specific fine-tuning to the semantic similarity step might raise caption association rates on specialized document types.
Explicit evaluation on multilingual PDFs and artistically designed layouts would clarify the operating envelope of the heuristic-semantic balance.

Load-bearing premise

The tested benchmarks and internal product data sufficiently represent the full variety of real-world PDFs, including complex layouts, non-English text, and heavily formatted files.

What would settle it

Accuracy measurements on a held-out collection of PDFs with irregular overlapping elements, non-Latin scripts, or non-standard dense formatting that fall below 90 percent detection or 85 percent caption association would falsify the claim of reliable broad performance.

read the original abstract

PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves $\geq96\%$ visual element detection accuracy and $93\%$ caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over $2\times$. We have deployed the proposed system in challenging production environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical production PDF parser with strong benchmark numbers but sparse evaluation details.

read the letter

The main takeaway is that this is a production-oriented PDF parser using a mix of spatial heuristics, layout analysis, and semantic similarity to extract figures, tables, and link their captions. It reports strong results on standard benchmarks and internal data, with deployment already in place. They achieve at least 96% accuracy on visual element detection and 93% on caption association. When plugged into multimodal RAG, it outperforms existing parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while cutting latency by more than 2 times. The deployment in a challenging production setting adds some credibility to the practical claims. What works here is the focus on real-world use. Many papers stop at benchmarks, but this one emphasizes low latency and integration into retrieval pipelines, which matters for applications. The soft spots come down to missing context around the results. The abstract does not include dataset details, splits, statistical tests, or error breakdowns. There are no ablations to show how much each part of the combination contributes. The stress-test concern is valid: we do not see evidence that the approach handles complex layouts, non-English documents, scanned files, or other out-of-distribution cases beyond what was tested. Internal product data diversity is not described either. This paper is for practitioners who need better document preprocessing for AI systems. Readers working on RAG or document AI will find the performance numbers and latency gains relevant. It is not pushing new theory or methods but refining an existing toolkit for reliability. It deserves a serious referee because the claims are specific and falsifiable, and the production angle is valuable even if the techniques are incremental. A review could clarify the evaluation and test the generalization. I would recommend sending it to peer review.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a lightweight, production-ready PDF parsing framework that detects visual elements (figures, tables, forms) and associates captions via a combination of spatial heuristics, layout analysis, and semantic similarity. It reports ≥96% visual element detection accuracy and 93% caption association accuracy on popular benchmarks and internal product data, claims significant outperformance over state-of-the-art parsers and large vision-language models when used as preprocessing for multimodal RAG (on internal data and the MMDocRAG benchmark), a >2× latency reduction, and successful deployment in a challenging production environment.

Significance. If the performance claims are substantiated, the work would offer practical value for document understanding pipelines and multimodal RAG systems by providing an efficient, deployable alternative to heavier models. The combination of heuristics for a lightweight solution and the reported production deployment are notable strengths that could influence real-world applications, provided the results generalize.

major comments (2)

[Abstract] Abstract: The central performance claims (≥96% detection accuracy, 93% caption association, outperformance on MMDocRAG, and >2× latency reduction) are presented without any description of the evaluation methodology, dataset splits, error analysis, statistical significance testing, or ablation studies. This absence makes it impossible to assess whether the numbers reflect robust, reproducible results or potential selection effects.
[Abstract] Abstract: The generalization assumption—that spatial heuristics + layout analysis + semantic similarity maintain high accuracy across diverse real-world PDFs—is load-bearing for the production-readiness claim, yet no evidence is provided for out-of-distribution cases such as non-English documents, multi-page spanning tables, dense vector graphics, scanned documents, or heavily formatted files beyond the tested benchmarks and internal data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concerns about the abstract's lack of methodological detail and generalization evidence below. We will revise the abstract and add supporting discussion to improve clarity and transparency without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (≥96% detection accuracy, 93% caption association, outperformance on MMDocRAG, and >2× latency reduction) are presented without any description of the evaluation methodology, dataset splits, error analysis, statistical significance testing, or ablation studies. This absence makes it impossible to assess whether the numbers reflect robust, reproducible results or potential selection effects.

Authors: We agree that the abstract's brevity omits these details. The full manuscript contains dedicated sections on evaluation methodology, including dataset descriptions with splits from benchmarks such as PubLayNet and DocBank plus internal collections, error analysis, ablation studies on heuristic components, and statistical significance via repeated runs with confidence intervals in the results tables. We will revise the abstract to briefly reference the evaluation setup, key datasets, and note that full details and ablations appear in the body of the paper. revision: yes
Referee: [Abstract] Abstract: The generalization assumption—that spatial heuristics + layout analysis + semantic similarity maintain high accuracy across diverse real-world PDFs—is load-bearing for the production-readiness claim, yet no evidence is provided for out-of-distribution cases such as non-English documents, multi-page spanning tables, dense vector graphics, scanned documents, or heavily formatted files beyond the tested benchmarks and internal data.

Authors: Our internal product data, drawn from a challenging production environment, encompasses diverse real-world PDFs including non-English documents, multi-page tables, scanned content, and varied formatting. The reported deployment success provides direct evidence of robustness beyond standard benchmarks. We acknowledge that explicit, isolated tests on edge cases like dense vector graphics would further strengthen the generalization claim. We will add a limitations and generalization discussion section summarizing available internal results on these cases. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical systems description of a heuristic-based PDF parsing framework (spatial heuristics + layout analysis + semantic similarity) whose performance is measured directly against external benchmarks and internal data. No equations, derivations, first-principles claims, or self-referential definitions appear in the provided text. All reported accuracies and latency improvements are framed as observed outcomes rather than reductions to fitted parameters or self-citation chains. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on domain assumptions about PDF structure rather than new mathematical entities or fitted constants.

axioms (1)

domain assumption PDF documents contain extractable spatial layout and textual information that spatial heuristics and semantic similarity can reliably exploit for element detection and caption association.
Invoked throughout the description of the detection and association pipeline.

pith-pipeline@v0.9.0 · 5480 in / 1353 out tokens · 30759 ms · 2026-05-08T08:28:18.520508+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Pdf 32000-1:2008 – document management – portable document format – part 1: Pdf 1.7,

Adobe Systems Incorporated. Pdf 32000-1:2008 – document management – portable document format – part 1: Pdf 1.7,

work page 2008
[2]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InThe 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2623–2631, 2019. 5

work page 2019
[3]

Apache tika.https:// tika.apache.org/, 2025

Apache Software Foundation. Apache tika.https:// tika.apache.org/, 2025. 5

work page 2025
[4]

PyMuPDF Documentation, 2023

Artifex Software Inc. PyMuPDF Documentation, 2023. Ac- cessed: 2025-07. 2

work page 2023
[5]

PyMuPDF: Python bindings for MuPDF.https://pymupdf.readthedocs.io/,

Artifex Software Inc. PyMuPDF: Python bindings for MuPDF.https://pymupdf.readthedocs.io/,

work page
[6]

Pdf-vqa: A new dataset for real-world vqa on pdf documents.European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023

Yihao Ding, Siwen Luo, Hyunsuk Chung, and Soyeon Caren Han. Pdf-vqa: A new dataset for real-world vqa on pdf documents.European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023. 5

work page 2023
[8]

MVQA: A dataset for multimodal infor- mation retrieval in pdf-based visual question answering

Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, and Soyeon Caren Han. Pdf-mvqa: A dataset for multimodal information retrieval in pdf-based visual question answering. arXiv:2404.12720, 2024. 5

work page arXiv 2024
[9]

Mmdocir: Benchmarking multi-modal retrieval for long documents, 2025

Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, and Yong Liu. Mmdocir: Benchmarking multi-modal retrieval for long documents, 2025. 5

work page 2025
[10]

Benchmarking retrieval- augmented multimomal generation for document question answering, 2025

Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, and Yong Liu. Benchmarking retrieval- augmented multimomal generation for document question answering, 2025. 5

work page 2025
[11]

Apache tika: A content anal- ysis toolkit.https://tika.apache.org/, 2024

Apache Software Foundation. Apache tika: A content anal- ysis toolkit.https://tika.apache.org/, 2024. Ac- cessed: 2024-06-24. 1

work page 2024
[12]

Document understanding — gemini api — google ai for developers, 2025

Google AI. Document understanding — gemini api — google ai for developers, 2025. Accessed: 2025-07-09. 5

work page 2025
[13]

Levenshtein

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet Physics Doklady, 10(8):707–710, 1966. 5

work page 1966
[14]

DocBank: A Benchmark Dataset for Doc- ument Layout Analysis

Yiheng Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. DocBank: A Benchmark Dataset for Doc- ument Layout Analysis. InProceedings of the 28th Interna- tional Conference on Computational Linguistics (COLING), pages 949–960, 2020. 1

work page 2020
[15]

Kefei Liu, Jo ˜ao Paulo C. L. da Costa, Hing Cheung So, and Andr´e L. F. de Almeida. Semi-blind receivers for joint sym- bol and channel estimation in space-time-frequency mimo- ofdm systems.IEEE Transactions on Signal Processing,

work page
[16]

Kefei Liu, Jo ˜ao Paulo C. L. da Costa, Hing Cheung So, Lei Huang, and Jieping Ye. Detection of number of compo- nents in candecomp/parafac models via minimum descrip- tion length.Digital Signal Processing, 2016. 5

work page 2016
[17]

Mathew, D

M. Mathew, D. Karatzas, and C.V . Jawahar. Docvqa: A dataset for vqa on document images.Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. 5

work page 2021
[18]

Mattmann

Chris A. Mattmann. tika-python: Python bindings to apache tika.https : / / github . com / chrismattmann / tika-python, 2024. Version 1.24. 5

work page 2024
[19]

Mattmann and Jukka Zitting.Tika in Action

Chris A. Mattmann and Jukka Zitting.Tika in Action. Man- ning Publications Co., 2012. 5

work page 2012
[20]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Luke Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pam Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 5

work page 2021
[21]

Multimodal multi-hop question answering through a conversation between tools and efficiently fine- tuned large language models, 2023

Hossein Rajabzadeh, Suyuchen Wang, Hyock Ju Kwon, and Bang Liu. Multimodal multi-hop question answering through a conversation between tools and efficiently fine- tuned large language models, 2023. 5

work page 2023
[22]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 3982–3992. Association for Computational Linguistics, 2019. 5

work page 2019
[23]

Pdfminer: Python pdf parser and an- alyzer.https://github.com/euske/pdfminer,

Yusuke Shinyama. Pdfminer: Python pdf parser and an- alyzer.https://github.com/euske/pdfminer,

work page
[24]

Accessed: 2025-07-15. 1, 5

work page 2025
[25]

pdfplumber: Python library for extract- ing information from PDFs.https://github.com/ jsvine/pdfplumber, 2024

Jeremy Singer-Vine. pdfplumber: Python library for extract- ing information from PDFs.https://github.com/ jsvine/pdfplumber, 2024. Version 0.10.3. 1, 5

work page 2024
[26]

unstructured: A library for preprocessing and parsing unstructured data.https:// github.com/Unstructured- IO/unstructured,

Unstructured Technologies. unstructured: A library for preprocessing and parsing unstructured data.https:// github.com/Unstructured- IO/unstructured,

work page
[27]

unstructured: An open-source toolkit for document parsing.https:// github.com/Unstructured- IO/unstructured,

Unstructured Technologies and contributors. unstructured: An open-source toolkit for document parsing.https:// github.com/Unstructured- IO/unstructured,

work page
[28]

Accessed: 2024-06-24. 1

work page 2024
[29]

Logo- 2K+: a large-scale logo dataset for scalable logo classifi- cation

Jing Wang, , Weiqing Min, , Sujuan Hou, Shengnan Ma, Yuanjie Zheng, Haishuai Wang, and Shuqiang Jiang. Logo- 2K+: a large-scale logo dataset for scalable logo classifi- cation. InAAAI Conference on Artificial Intelligence. Ac- cepted, 2020. 3

work page 2020
[30]

Mm-llms: Recent advances in multimodal large language models, 2024

Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent advances in multimodal large language models, 2024. 5

work page 2024
[31]

Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception, 2024. 1

work page 2024
[32]

Pub- LayNet: Largest Dataset Ever for Document Layout Analy- sis

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub- LayNet: Largest Dataset Ever for Document Layout Analy- sis. In2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022, 2019. 1

work page 2019

[1] [1]

Pdf 32000-1:2008 – document management – portable document format – part 1: Pdf 1.7,

Adobe Systems Incorporated. Pdf 32000-1:2008 – document management – portable document format – part 1: Pdf 1.7,

work page 2008

[2] [2]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InThe 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2623–2631, 2019. 5

work page 2019

[3] [3]

Apache tika.https:// tika.apache.org/, 2025

Apache Software Foundation. Apache tika.https:// tika.apache.org/, 2025. 5

work page 2025

[4] [4]

PyMuPDF Documentation, 2023

Artifex Software Inc. PyMuPDF Documentation, 2023. Ac- cessed: 2025-07. 2

work page 2023

[5] [5]

PyMuPDF: Python bindings for MuPDF.https://pymupdf.readthedocs.io/,

Artifex Software Inc. PyMuPDF: Python bindings for MuPDF.https://pymupdf.readthedocs.io/,

work page

[6] [6]

Pdf-vqa: A new dataset for real-world vqa on pdf documents.European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023

Yihao Ding, Siwen Luo, Hyunsuk Chung, and Soyeon Caren Han. Pdf-vqa: A new dataset for real-world vqa on pdf documents.European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023. 5

work page 2023

[7] [8]

MVQA: A dataset for multimodal infor- mation retrieval in pdf-based visual question answering

Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, and Soyeon Caren Han. Pdf-mvqa: A dataset for multimodal information retrieval in pdf-based visual question answering. arXiv:2404.12720, 2024. 5

work page arXiv 2024

[8] [9]

Mmdocir: Benchmarking multi-modal retrieval for long documents, 2025

Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, and Yong Liu. Mmdocir: Benchmarking multi-modal retrieval for long documents, 2025. 5

work page 2025

[9] [10]

Benchmarking retrieval- augmented multimomal generation for document question answering, 2025

Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, and Yong Liu. Benchmarking retrieval- augmented multimomal generation for document question answering, 2025. 5

work page 2025

[10] [11]

Apache tika: A content anal- ysis toolkit.https://tika.apache.org/, 2024

Apache Software Foundation. Apache tika: A content anal- ysis toolkit.https://tika.apache.org/, 2024. Ac- cessed: 2024-06-24. 1

work page 2024

[11] [12]

Document understanding — gemini api — google ai for developers, 2025

Google AI. Document understanding — gemini api — google ai for developers, 2025. Accessed: 2025-07-09. 5

work page 2025

[12] [13]

Levenshtein

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet Physics Doklady, 10(8):707–710, 1966. 5

work page 1966

[13] [14]

DocBank: A Benchmark Dataset for Doc- ument Layout Analysis

Yiheng Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. DocBank: A Benchmark Dataset for Doc- ument Layout Analysis. InProceedings of the 28th Interna- tional Conference on Computational Linguistics (COLING), pages 949–960, 2020. 1

work page 2020

[14] [15]

Kefei Liu, Jo ˜ao Paulo C. L. da Costa, Hing Cheung So, and Andr´e L. F. de Almeida. Semi-blind receivers for joint sym- bol and channel estimation in space-time-frequency mimo- ofdm systems.IEEE Transactions on Signal Processing,

work page

[15] [16]

Kefei Liu, Jo ˜ao Paulo C. L. da Costa, Hing Cheung So, Lei Huang, and Jieping Ye. Detection of number of compo- nents in candecomp/parafac models via minimum descrip- tion length.Digital Signal Processing, 2016. 5

work page 2016

[16] [17]

Mathew, D

M. Mathew, D. Karatzas, and C.V . Jawahar. Docvqa: A dataset for vqa on document images.Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. 5

work page 2021

[17] [18]

Mattmann

Chris A. Mattmann. tika-python: Python bindings to apache tika.https : / / github . com / chrismattmann / tika-python, 2024. Version 1.24. 5

work page 2024

[18] [19]

Mattmann and Jukka Zitting.Tika in Action

Chris A. Mattmann and Jukka Zitting.Tika in Action. Man- ning Publications Co., 2012. 5

work page 2012

[19] [20]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Luke Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pam Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 5

work page 2021

[20] [21]

Multimodal multi-hop question answering through a conversation between tools and efficiently fine- tuned large language models, 2023

Hossein Rajabzadeh, Suyuchen Wang, Hyock Ju Kwon, and Bang Liu. Multimodal multi-hop question answering through a conversation between tools and efficiently fine- tuned large language models, 2023. 5

work page 2023

[21] [22]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 3982–3992. Association for Computational Linguistics, 2019. 5

work page 2019

[22] [23]

Pdfminer: Python pdf parser and an- alyzer.https://github.com/euske/pdfminer,

Yusuke Shinyama. Pdfminer: Python pdf parser and an- alyzer.https://github.com/euske/pdfminer,

work page

[23] [24]

Accessed: 2025-07-15. 1, 5

work page 2025

[24] [25]

pdfplumber: Python library for extract- ing information from PDFs.https://github.com/ jsvine/pdfplumber, 2024

Jeremy Singer-Vine. pdfplumber: Python library for extract- ing information from PDFs.https://github.com/ jsvine/pdfplumber, 2024. Version 0.10.3. 1, 5

work page 2024

[25] [26]

unstructured: A library for preprocessing and parsing unstructured data.https:// github.com/Unstructured- IO/unstructured,

Unstructured Technologies. unstructured: A library for preprocessing and parsing unstructured data.https:// github.com/Unstructured- IO/unstructured,

work page

[26] [27]

unstructured: An open-source toolkit for document parsing.https:// github.com/Unstructured- IO/unstructured,

Unstructured Technologies and contributors. unstructured: An open-source toolkit for document parsing.https:// github.com/Unstructured- IO/unstructured,

work page

[27] [28]

Accessed: 2024-06-24. 1

work page 2024

[28] [29]

Logo- 2K+: a large-scale logo dataset for scalable logo classifi- cation

Jing Wang, , Weiqing Min, , Sujuan Hou, Shengnan Ma, Yuanjie Zheng, Haishuai Wang, and Shuqiang Jiang. Logo- 2K+: a large-scale logo dataset for scalable logo classifi- cation. InAAAI Conference on Artificial Intelligence. Ac- cepted, 2020. 3

work page 2020

[29] [30]

Mm-llms: Recent advances in multimodal large language models, 2024

Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent advances in multimodal large language models, 2024. 5

work page 2024

[30] [31]

Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception, 2024. 1

work page 2024

[31] [32]

Pub- LayNet: Largest Dataset Ever for Document Layout Analy- sis

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub- LayNet: Largest Dataset Ever for Document Layout Analy- sis. In2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022, 2019. 1

work page 2019