DocAtlas: Multilingual Document Understanding Across 80+ Languages

Abdullah Sohail; Ahmed Heakl; Ahmed Nassar; Fahad Shahbaz Khan; Imran Razzak; Peter W. J. Staar; Rania Elbadry; Salman Khan; Youssef Mohamed

arxiv: 2605.12623 · v2 · pith:MRWBTX6Anew · submitted 2026-05-12 · 💻 cs.CL · cs.CV· cs.LG

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Ahmed Heakl , Youssef Mohamed , Abdullah Sohail , Rania Elbadry , Ahmed Nassar , Peter W. J. Staar , Fahad Shahbaz Khan , Imran Razzak

show 1 more author

Salman Khan

This is my paper

Pith reviewed 2026-05-22 09:47 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.LG

keywords multilingual document understandingOCR datasetsDirect Preference Optimizationdocument layout analysislow-resource languagesmultilingual adaptationstructural annotation

0 comments

The pith

Direct Preference Optimization using rendering-derived annotations adapts document models to 82 languages with accuracy gains and no base-language degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome scarce training data and biased model annotations that limit document understanding in low-resource languages. It builds DocAtlas as a framework for high-fidelity datasets and benchmarks spanning 82 languages through model-free pipelines that render native documents and generate synthetic ones to produce precise structural labels. The central demonstration is that Direct Preference Optimization trained on these rendering-based labels as positive signals delivers stable multilingual adaptation, raising both in-domain and out-of-domain performance while leaving base-language accuracy intact, in contrast to supervised fine-tuning that can sharply reduce out-of-domain results.

Core claim

DocAtlas constructs OCR datasets and benchmarks across 82 languages and 9 tasks via dual pipelines of differential rendering from native DOCX files and synthetic LaTeX generation for right-to-left scripts, yielding unified DocTag annotations for layout, text, and component types without learned models. Direct Preference Optimization that treats these rendering-derived labels as the positive signal achieves stable adaptation, producing +1.9% in-domain and +1.8% out-of-domain accuracy improvements with no measurable degradation on base languages, whereas supervised fine-tuning degrades out-of-domain performance by up to 21%. The best resulting model improves 1.7% over the strongest baseline.

What carries the argument

Dual rendering pipelines that generate precise structural annotations in DocTag format from native and synthetic documents, used as reliable positive signals in Direct Preference Optimization for multilingual adaptation.

If this is right

Multilingual adaptation improves accuracy both inside and outside the training distribution.
Base-language performance remains unchanged after adaptation.
The method avoids the large out-of-domain drops produced by supervised fine-tuning.
The resulting models outperform prior state-of-the-art baselines on the new benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rendering-plus-DPO pipeline could be tested on additional document formats such as PDF or HTML to broaden coverage.
The approach may support iterative improvement by feeding model outputs back into the rendering loop for self-refinement.
Similar preference signals derived from rendering could be explored for other layout-sensitive tasks like table extraction or form understanding.

Load-bearing premise

Differential rendering of native documents and synthetic LaTeX generation produce precise structural annotations that serve as reliable ground truth for DPO without introducing systematic errors in layout or component typing.

What would settle it

A test set of additional low-resource languages where the DPO-adapted model shows out-of-domain degradation comparable to supervised fine-tuning would falsify the stability claim.

Figures

Figures reproduced from arXiv: 2605.12623 by Abdullah Sohail, Ahmed Heakl, Ahmed Nassar, Fahad Shahbaz Khan, Imran Razzak, Peter W. J. Staar, Rania Elbadry, Salman Khan, Youssef Mohamed.

**Figure 1.** Figure 1: Overview of the DocAtlas framework. (Left) Global script coverage across 80+ languages spanning 10 writing systems, illustrating the geographical and typological diversity of the corpus. (Right) Cross-lingual transfer performance after DPO training, showing consistent gains in both in-domain and out-of-domain accuracy across major OCR and vision-language models. Abstract Multilingual document understanding… view at source ↗

**Figure 2.** Figure 2: End-to-end data pipelines. We implement two pipelines: a high-fidelity pipeline for native DOCX documents and a synthetic RTL pipeline for underrepresented scripts. The native pipeline extracts, filters, colorizes, and annotates Word files, while the RTL pipeline converts structured inputs (EPUB, HTML, XML) into precisely annotated PDF documents using LaTeX synthesis. over-union (IoU) containment. When com… view at source ↗

**Figure 4.** Figure 4: Overview of the DocAtlas synthetic data generation pipeline. Structured inputs (HTML, XML, DOCX, EPUB) are parsed into DocTag snippets and rendered via LATEX templates with positional logging. Through multiple compilations, the system produces aligned PDF documents and precise element-level annotations (DocTag, Markdown, and visual overlays). 3.3 Benchmark We assembled a multilingual benchmark balancing … view at source ↗

**Figure 5.** Figure 5: Accuracy distribution across high- and lowresource languages [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: OCR accuracy across language families. Scores (brighter is better) show average performance across 14 models and 7 families. Top models (e.g., DeepseekOCR, Chandra) are consistent, while others degrade on low-resource scripts. Arabic Chinese CroatianDutchEnglish FrenchHindi Italian Polish Russian Serbian Spanish Thai Ukrainian Vietnamese Language 0 20 40 60 80 Chart Score (%) DeepseekOCR NanosetsOCR2 Gemin… view at source ↗

**Figure 7.** Figure 7: Chart extraction accuracy across 15 languages. Gemini-2.5-Flash achieves the highest average. Multilingual Chart Extraction Chart extraction reveals a critical divide between specialized OCR systems and general-purpose vision-language models. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: DPO gains across language families. Language family gains reveal typological patterns [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Language frequency distribution in the DocAtlas corpus. The dataset exhibits a long-tailed distribution across 80+ languages, with high-resource scripts (e.g., en, ru, es) dominating the head and low-resource languages (e.g., ps, ckb, ku, azb) forming a diverse tail. 10 3 10 4 10 5 10 6 Frequency text list table heading_1 figure footer heading_2 header heading_3 heading_4 toc title form_tag quote table_cap… view at source ↗

**Figure 10.** Figure 10: Tag frequency distribution in DocAtlas. Download, Safety Filtering, and Annotation [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 12.** Figure 12: Perplexity-based filtering across five languages. David, Narkisim, and Frank Ruehl for Hebrew; Nazanin, Lotus, and Iranian Sans for Persian; and Nastaliq and Naskh for Urdu. Persian templates additionally support mixed LTR/RTL layouts for scientific content. Rendering and Quality Control. The synthesis engine, built on LuaTeX with custom positional logging commands, operates in three compilation passes: … view at source ↗

**Figure 14.** Figure 14: Scores vs. model scale. Each point represents a model; marker size encodes parameter count. Larger models do not uniformly dominate: several compact expert systems (≤3B) match or exceed generalpurpose VLMs on both text and table scores. 8.3 Layout Robustness [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of layout parser failure cases. Common error types (extra, overlapping, missing detections; wrong categories) are shown for Layout Parser (left) and our DocAtlas system (right). DocAtlas consistently produces more accurate and cleaner segmentation. Model Size 100 M 1B 10 B Text 100 B Model Size 100 M 1B 10 B 100 B NanosetsOCR2 Qwen3-VL MinerU2.5 Qwen2.5-VL Granite-Docling SmolDoclin… view at source ↗

**Figure 15.** Figure 15: Document type performance. 9.2 Document Type Evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

read the original abstract

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline. Code is available at https://github.com/ahmedheakl/DocAtlas .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DocAtlas builds broad multilingual document data via model-free rendering and uses it for DPO adaptation that holds up better than SFT, but the results rest on unvalidated annotations.

read the letter

DocAtlas builds a dataset for 82 languages by rendering native DOCX files differentially and generating synthetic LaTeX versions for right-to-left scripts, then feeds the resulting DocTag annotations into DPO as the preferred output. The main result is that this yields small gains on both in-domain and out-of-domain tasks while avoiding the large out-of-domain drop that supervised fine-tuning produces on the base language. That combination of scale and stability is the practical takeaway worth noting first.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DocAtlas, a framework for constructing high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. It employs dual annotation pipelines—differential rendering of native DOCX documents and synthetic LaTeX generation for right-to-left scripts—to produce structural annotations in a unified DocTag format encoding layout, text, and component types without learned models. Evaluation of 16 state-of-the-art models reveals persistent gaps in low-resource scripts. The central empirical result is that Direct Preference Optimization (DPO) using rendering-derived ground truth as the positive signal yields stable multilingual adaptation, with +1.9% in-domain and +1.8% out-of-domain accuracy gains and no measurable base-language degradation, while supervised fine-tuning degrades out-of-domain performance by up to 21%. The best variant (DocAtlas-DeepSeek) improves +1.7% over the strongest baseline. Code is released at the provided GitHub link.

Significance. If the annotation pipeline produces unbiased ground truth, the work provides a scalable, model-free method for creating multilingual document datasets that could reduce annotation biases in low-resource languages. The contrast between DPO's stable adaptation and SFT's degradation is a potentially useful empirical finding for multilingual fine-tuning strategies. Releasing code supports reproducibility, though the absence of statistical details limits immediate impact assessment.

major comments (2)

[Abstract / Results] Abstract and results section: the headline DPO improvements (+1.9% in-domain, +1.8% OOD) and the claim of 'no measurable base-language degradation' are reported without error bars, statistical significance tests, details on data splits, or exhaustive baseline comparisons. This weakens the central adaptation claim, as the reported gains rest on limited visible evidence and could be sensitive to evaluation choices.
[Annotation Pipelines] Annotation pipeline description: the assertion that differential DOCX rendering and synthetic LaTeX 'produce precise structural annotations' in DocTag format without introducing systematic errors is load-bearing for the DPO positive-signal construction. No error-rate quantification, human validation of annotations, or analysis of failure modes for low-resource/RTL scripts is referenced, leaving open the possibility that DPO gains reflect annotation artifacts rather than genuine adaptation.

minor comments (2)

[Methods] Clarify the exact definition and schema of the DocTag format (e.g., how component types and layout elements are encoded) to aid reproducibility.
[Experiments] The evaluation of 16 models would benefit from an explicit table listing all baselines, their training regimes, and per-language or per-task breakdowns rather than aggregate percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we will revise the manuscript to incorporate additional details and validations as suggested.

read point-by-point responses

Referee: Abstract and results section: the headline DPO improvements (+1.9% in-domain, +1.8% OOD) and the claim of 'no measurable base-language degradation' are reported without error bars, statistical significance tests, details on data splits, or exhaustive baseline comparisons. This weakens the central adaptation claim, as the reported gains rest on limited visible evidence and could be sensitive to evaluation choices.

Authors: We agree with the referee that the central empirical claims would benefit from greater statistical transparency. In the revised manuscript, we will add error bars derived from repeated evaluations across different random seeds for the reported accuracy gains. We will also include results from statistical significance testing (such as McNemar's test or paired t-tests where applicable) to evaluate the improvements. Furthermore, we will provide detailed descriptions of the data splits used for in-domain and out-of-domain assessments and include additional baseline models and ablation studies to make the comparisons more exhaustive. These changes will address the concern regarding the robustness of the adaptation results. revision: yes
Referee: Annotation pipeline description: the assertion that differential DOCX rendering and synthetic LaTeX 'produce precise structural annotations' in DocTag format without introducing systematic errors is load-bearing for the DPO positive-signal construction. No error-rate quantification, human validation of annotations, or analysis of failure modes for low-resource/RTL scripts is referenced, leaving open the possibility that DPO gains reflect annotation artifacts rather than genuine adaptation.

Authors: The annotation pipelines are constructed to be deterministic and free of learned components to minimize bias introduction. Differential rendering from DOCX files directly captures the structural elements, and the LaTeX-based approach for RTL scripts ensures accurate text and layout rendering through standard compilation. Nevertheless, we recognize the value of explicit validation. We will revise the manuscript to include an analysis of annotation quality, featuring error rates computed against human-annotated samples for a diverse set of languages including low-resource and RTL scripts. We will also provide a discussion of potential failure modes and how the pipelines handle them. The open-sourced code will facilitate community verification of these aspects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmarks and independent rendering pipeline.

full rationale

The manuscript introduces DocAtlas as a data-construction framework using differential DOCX rendering and synthetic LaTeX generation to produce DocTag annotations, then reports empirical accuracy numbers for 16 models plus DPO adaptation. No equations, fitted parameters, or first-principles derivations appear; the reported +1.9 % / +1.8 % gains and the contrast with SFT are direct experimental comparisons against external baselines rather than quantities defined in terms of the same data or self-citations. The ground-truth pipeline is presented as an external, non-learned process, so the central claims do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework depends on the assumption that rendering pipelines yield unbiased ground truth and that DPO can leverage this signal without introducing new biases; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Differential rendering of native DOCX documents produces precise structural annotations without learned models
Invoked as the source of ground truth for both dataset creation and DPO positive signals
domain assumption Synthetic LaTeX generation accurately captures right-to-left script layout and component types
Required for the second pipeline to cover RTL languages reliably

invented entities (1)

DocTag format no independent evidence
purpose: Unified encoding of layout, text, and component types
New annotation schema introduced to standardize output across pipelines

pith-pipeline@v0.9.0 · 5758 in / 1428 out tokens · 45570 ms · 2026-05-22T09:47:55.663777+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 13 internal anchors

[1]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. arXiv preprint arXiv:2409.01704 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models , author=. arXiv preprint arXiv:2502.18443 , year=

work page arXiv
[3]

arXiv preprint arXiv:2506.05218 , year=

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm , author=. arXiv preprint arXiv:2506.05218 , year=

work page arXiv
[4]

2025 , note=

Mistral OCR , author=. 2025 , note=

work page 2025
[5]

DeepSeek-OCR: Contexts Optical Compression

Deepseek-ocr: Contexts optical compression , author=. arXiv preprint arXiv:2510.18234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

International Conference on Learning Representations , volume=

Nougat: Neural optical understanding for academic documents , author=. International Conference on Learning Representations , volume=

work page
[7]

arXiv preprint arXiv:2408.09869 , year=

Docling technical report , author=. arXiv preprint arXiv:2408.09869 , year=

work page arXiv
[8]

Granite Docling: A 258M-Parameter Multimodal VLM for Document Understanding , author=

work page
[9]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[10]

Marker: Fast and Accurate PDF to Markdown Converter , author=

work page
[11]

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. arXiv preprint arXiv:2409.18839 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

5: A decoupled vision-language model for efficient high-resolution document parsing , author=

Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing , author=. The 64th Annual Meeting of the Association for Computational Linguistics--Industry Track , year=

work page
[13]

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=

work page
[14]

Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting , author=. Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page
[15]

Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

work page
[16]

Mathpix OCR API , author=

work page
[17]

Pix2Text: An Open-Source Tool for Recognizing Layouts, Tables, Math Formulas, and Text in Images , author=

work page
[18]

OCRFlux: Mastering Complex Layouts and Seamless Page Merging , author=

work page
[19]

Unstructured: Open-Source Pre-Processing Tools for Unstructured Data , author=

work page
[20]

OpenParse: Visually-Driven Document Parser for LLM Ingestion , author=

work page
[21]

2025 , note=

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , note=

work page 2025
[22]

PaddleOCR 3.0 Technical Report

PaddleOCR 3.0 Technical Report , author=. arXiv preprint arXiv:2507.05595 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

Ocean-OCR: Towards General OCR Application via a Vision-Language Model , author=. arXiv preprint arXiv:2501.15558 , year=

work page arXiv
[24]

arXiv preprint arXiv:2509.01215 , year=

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion , author=. arXiv preprint arXiv:2509.01215 , year=

work page arXiv
[25]

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

DocVQA: A Dataset for VQA on Document Images , author=. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

work page
[26]

2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , author=. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=. 2019 , organization=. doi:10.1109/ICDARW.2019.10029 , note=

work page doi:10.1109/icdarw.2019.10029 2019
[27]

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=

ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT , author=. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=. 2017 , organization=

work page 2017
[28]

Findings of the Association for Computational Linguistics: ACL 2022 , pages=

XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=. 2022 , doi=

work page 2022
[29]

2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=

PubLayNet: Largest Dataset Ever for Document Layout Analysis , author=. 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=. 2019 , organization=. doi:10.1109/ICDAR.2019.00166 , note=

work page doi:10.1109/icdar.2019.00166 2019
[30]

arXiv preprint arXiv:2206.01062 , year=

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis , author=. arXiv preprint arXiv:2206.01062 , year=

work page arXiv
[31]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models , author=. arXiv preprint arXiv:2305.07895 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

ICDAR 2019 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Recognition - RRC-MLT-2019 , author=. arXiv preprint arXiv:1907.00945 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2019
[33]

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

LayoutLM: Pre-training of Text and Layout for Document Image Understanding , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=. 2020 , doi=

work page 2020
[34]

arXiv preprint arXiv:2012.14740 , year=

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2012.14740 , year=

work page arXiv 2012
[35]

Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86549-8_9 , note=

work page doi:10.1007/978-3-030-86549-8_9 2021
[36]

arXiv preprint arXiv:2103.15992 , year=

A Multiplexed Network for End-to-End, Multilingual OCR , author=. arXiv preprint arXiv:2103.15992 , year=

work page arXiv
[37]

arXiv preprint arXiv:2410.16153 , year=

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages , author=. arXiv preprint arXiv:2410.16153 , year=

work page arXiv
[38]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2023 , note=

work page 2023
[39]

arXiv preprint arXiv:2104.08836 , year=

LayoutXLM: Multimodal Pre-training for Multilingual Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2104.08836 , year=

work page arXiv
[40]

arXiv preprint arXiv:2412.17787 , year=

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective , author=. arXiv preprint arXiv:2412.17787 , year=

work page arXiv
[41]

Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=

SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86337-1_8 , note=

work page doi:10.1007/978-3-030-86337-1_8 2021
[42]

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

A Synthetic Recipe for OCR , author=. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020
[43]

arXiv preprint arXiv:2103.08236 , year=

Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs , author=. arXiv preprint arXiv:2103.08236 , year=

work page arXiv
[44]

Journal of Imaging , volume=

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images , author=. Journal of Imaging , volume=. 2017 , publisher=

work page 2017
[45]

TextRecognitionDataGenerator , author=

work page
[46]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[47]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

AAAI Conference on Artificial Intelligence , year=

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion , author=. AAAI Conference on Artificial Intelligence , year=

work page
[51]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

work page
[54]

2025 , version =

Datalab To , title =. 2025 , version =

work page 2025
[55]

European conference on computer vision , pages=

Image-based table recognition: data, model, and evaluation , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020
[56]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Readoc: A unified benchmark for realistic document structured extraction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[57]

Docbank: A bench- mark dataset for document layout analysis

Docbank: A benchmark dataset for document layout analysis , author=. arXiv preprint arXiv:2006.01038 , year=

work page arXiv 2006
[58]

arXiv preprint arXiv:2210.05391 , year=

Pp-structurev2: A stronger document analysis system , author=. arXiv preprint arXiv:2210.05391 , year=

work page arXiv
[59]

European Conference on Computer Vision , pages=

Ocr-free document understanding transformer , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[60]

International Conference on Machine Learning , pages=

Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[61]

Proceedings of the national academy of sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

work page 2017
[62]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

work page
[63]

IEEE Transactions on Audio, Speech and Language Processing , year=

An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , year=

work page
[64]

arXiv preprint arXiv:2507.06761 , year=

Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu , author=. arXiv preprint arXiv:2507.06761 , year=

work page arXiv
[65]

European Conference on Computer Vision , pages=

Task grouping for multilingual text recognition , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[66]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Kitab-bench: A comprehensive multi-domain benchmark for arabic ocr and document understanding , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[67]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[68]

How to Teach Large Multimodal Models New Skills

How to Teach Large Multimodal Models New Skills , author=. arXiv preprint arXiv:2510.08564 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

, title =

Bradski, G. , title =. Dr. Dobb's Journal of Software Tools , year =

work page
[70]

2024 , howpublished =

work page 2024
[71]

FastText.zip: Compressing text classification models

Fasttext. zip: Compressing text classification models , author=. arXiv preprint arXiv:1612.03651 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Proceedings of the 30th ACM international conference on multimedia , pages=

Layoutlmv3: Pre-training for document ai with unified text and image masking , author=. Proceedings of the 30th ACM international conference on multimedia , pages=

work page
[73]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

CCNet: Extracting high quality monolingual datasets from web crawl data , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

work page
[74]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[75]

IEEE Transactions on Big Data , year=

The faiss library , author=. IEEE Transactions on Big Data , year=

work page
[76]

HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data, February , volume=

Common Crawl: Data collection and use cases for NLP , author=. HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data, February , volume=

work page
[77]

Advances in Neural Information Processing Systems , volume=

WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data , author=. Advances in Neural Information Processing Systems , volume=

work page
[78]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

From chaotic ocr words to coherent document: A fine-to-coarse zoom-out network for complex-layout document image translation , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page

[1] [1]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. arXiv preprint arXiv:2409.01704 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models , author=. arXiv preprint arXiv:2502.18443 , year=

work page arXiv

[3] [3]

arXiv preprint arXiv:2506.05218 , year=

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm , author=. arXiv preprint arXiv:2506.05218 , year=

work page arXiv

[4] [4]

2025 , note=

Mistral OCR , author=. 2025 , note=

work page 2025

[5] [5]

DeepSeek-OCR: Contexts Optical Compression

Deepseek-ocr: Contexts optical compression , author=. arXiv preprint arXiv:2510.18234 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

International Conference on Learning Representations , volume=

Nougat: Neural optical understanding for academic documents , author=. International Conference on Learning Representations , volume=

work page

[7] [7]

arXiv preprint arXiv:2408.09869 , year=

Docling technical report , author=. arXiv preprint arXiv:2408.09869 , year=

work page arXiv

[8] [8]

Granite Docling: A 258M-Parameter Multimodal VLM for Document Understanding , author=

work page

[9] [9]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[10] [10]

Marker: Fast and Accurate PDF to Markdown Converter , author=

work page

[11] [11]

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. arXiv preprint arXiv:2409.18839 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

5: A decoupled vision-language model for efficient high-resolution document parsing , author=

Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing , author=. The 64th Annual Meeting of the Association for Computational Linguistics--Industry Track , year=

work page

[13] [13]

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=

work page

[14] [14]

Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting , author=. Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page

[15] [15]

Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

work page

[16] [16]

Mathpix OCR API , author=

work page

[17] [17]

Pix2Text: An Open-Source Tool for Recognizing Layouts, Tables, Math Formulas, and Text in Images , author=

work page

[18] [18]

OCRFlux: Mastering Complex Layouts and Seamless Page Merging , author=

work page

[19] [19]

Unstructured: Open-Source Pre-Processing Tools for Unstructured Data , author=

work page

[20] [20]

OpenParse: Visually-Driven Document Parser for LLM Ingestion , author=

work page

[21] [21]

2025 , note=

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , note=

work page 2025

[22] [22]

PaddleOCR 3.0 Technical Report

PaddleOCR 3.0 Technical Report , author=. arXiv preprint arXiv:2507.05595 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

Ocean-OCR: Towards General OCR Application via a Vision-Language Model , author=. arXiv preprint arXiv:2501.15558 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:2509.01215 , year=

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion , author=. arXiv preprint arXiv:2509.01215 , year=

work page arXiv

[25] [25]

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

DocVQA: A Dataset for VQA on Document Images , author=. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

work page

[26] [26]

2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , author=. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=. 2019 , organization=. doi:10.1109/ICDARW.2019.10029 , note=

work page doi:10.1109/icdarw.2019.10029 2019

[27] [27]

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=

ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT , author=. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=. 2017 , organization=

work page 2017

[28] [28]

Findings of the Association for Computational Linguistics: ACL 2022 , pages=

XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=. 2022 , doi=

work page 2022

[29] [29]

2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=

PubLayNet: Largest Dataset Ever for Document Layout Analysis , author=. 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=. 2019 , organization=. doi:10.1109/ICDAR.2019.00166 , note=

work page doi:10.1109/icdar.2019.00166 2019

[30] [30]

arXiv preprint arXiv:2206.01062 , year=

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis , author=. arXiv preprint arXiv:2206.01062 , year=

work page arXiv

[31] [31]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models , author=. arXiv preprint arXiv:2305.07895 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

ICDAR 2019 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Recognition - RRC-MLT-2019 , author=. arXiv preprint arXiv:1907.00945 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2019

[33] [33]

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

LayoutLM: Pre-training of Text and Layout for Document Image Understanding , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=. 2020 , doi=

work page 2020

[34] [34]

arXiv preprint arXiv:2012.14740 , year=

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2012.14740 , year=

work page arXiv 2012

[35] [35]

Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86549-8_9 , note=

work page doi:10.1007/978-3-030-86549-8_9 2021

[36] [36]

arXiv preprint arXiv:2103.15992 , year=

A Multiplexed Network for End-to-End, Multilingual OCR , author=. arXiv preprint arXiv:2103.15992 , year=

work page arXiv

[37] [37]

arXiv preprint arXiv:2410.16153 , year=

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages , author=. arXiv preprint arXiv:2410.16153 , year=

work page arXiv

[38] [38]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2023 , note=

work page 2023

[39] [39]

arXiv preprint arXiv:2104.08836 , year=

LayoutXLM: Multimodal Pre-training for Multilingual Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2104.08836 , year=

work page arXiv

[40] [40]

arXiv preprint arXiv:2412.17787 , year=

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective , author=. arXiv preprint arXiv:2412.17787 , year=

work page arXiv

[41] [41]

Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=

SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86337-1_8 , note=

work page doi:10.1007/978-3-030-86337-1_8 2021

[42] [42]

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

A Synthetic Recipe for OCR , author=. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020

[43] [43]

arXiv preprint arXiv:2103.08236 , year=

Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs , author=. arXiv preprint arXiv:2103.08236 , year=

work page arXiv

[44] [44]

Journal of Imaging , volume=

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images , author=. Journal of Imaging , volume=. 2017 , publisher=

work page 2017

[45] [45]

TextRecognitionDataGenerator , author=

work page

[46] [46]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[47] [47]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

AAAI Conference on Artificial Intelligence , year=

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion , author=. AAAI Conference on Artificial Intelligence , year=

work page

[51] [51]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

work page

[54] [54]

2025 , version =

Datalab To , title =. 2025 , version =

work page 2025

[55] [55]

European conference on computer vision , pages=

Image-based table recognition: data, model, and evaluation , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020

[56] [56]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Readoc: A unified benchmark for realistic document structured extraction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[57] [57]

Docbank: A bench- mark dataset for document layout analysis

Docbank: A benchmark dataset for document layout analysis , author=. arXiv preprint arXiv:2006.01038 , year=

work page arXiv 2006

[58] [58]

arXiv preprint arXiv:2210.05391 , year=

Pp-structurev2: A stronger document analysis system , author=. arXiv preprint arXiv:2210.05391 , year=

work page arXiv

[59] [59]

European Conference on Computer Vision , pages=

Ocr-free document understanding transformer , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022

[60] [60]

International Conference on Machine Learning , pages=

Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[61] [61]

Proceedings of the national academy of sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

work page 2017

[62] [62]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

work page

[63] [63]

IEEE Transactions on Audio, Speech and Language Processing , year=

An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , year=

work page

[64] [64]

arXiv preprint arXiv:2507.06761 , year=

Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu , author=. arXiv preprint arXiv:2507.06761 , year=

work page arXiv

[65] [65]

European Conference on Computer Vision , pages=

Task grouping for multilingual text recognition , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022

[66] [66]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Kitab-bench: A comprehensive multi-domain benchmark for arabic ocr and document understanding , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[67] [67]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page

[68] [68]

How to Teach Large Multimodal Models New Skills

How to Teach Large Multimodal Models New Skills , author=. arXiv preprint arXiv:2510.08564 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

, title =

Bradski, G. , title =. Dr. Dobb's Journal of Software Tools , year =

work page

[70] [70]

2024 , howpublished =

work page 2024

[71] [71]

FastText.zip: Compressing text classification models

Fasttext. zip: Compressing text classification models , author=. arXiv preprint arXiv:1612.03651 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

Proceedings of the 30th ACM international conference on multimedia , pages=

Layoutlmv3: Pre-training for document ai with unified text and image masking , author=. Proceedings of the 30th ACM international conference on multimedia , pages=

work page

[73] [73]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

CCNet: Extracting high quality monolingual datasets from web crawl data , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

work page

[74] [74]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[75] [75]

IEEE Transactions on Big Data , year=

The faiss library , author=. IEEE Transactions on Big Data , year=

work page

[76] [76]

HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data, February , volume=

Common Crawl: Data collection and use cases for NLP , author=. HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data, February , volume=

work page

[77] [77]

Advances in Neural Information Processing Systems , volume=

WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data , author=. Advances in Neural Information Processing Systems , volume=

work page

[78] [78]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

From chaotic ocr words to coherent document: A fine-to-coarse zoom-out network for complex-layout document image translation , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page