pith. sign in

arxiv: 2605.12623 · v2 · pith:MRWBTX6Anew · submitted 2026-05-12 · 💻 cs.CL · cs.CV· cs.LG

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Pith reviewed 2026-05-22 09:47 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.LG
keywords multilingual document understandingOCR datasetsDirect Preference Optimizationdocument layout analysislow-resource languagesmultilingual adaptationstructural annotation
0
0 comments X

The pith

Direct Preference Optimization using rendering-derived annotations adapts document models to 82 languages with accuracy gains and no base-language degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome scarce training data and biased model annotations that limit document understanding in low-resource languages. It builds DocAtlas as a framework for high-fidelity datasets and benchmarks spanning 82 languages through model-free pipelines that render native documents and generate synthetic ones to produce precise structural labels. The central demonstration is that Direct Preference Optimization trained on these rendering-based labels as positive signals delivers stable multilingual adaptation, raising both in-domain and out-of-domain performance while leaving base-language accuracy intact, in contrast to supervised fine-tuning that can sharply reduce out-of-domain results.

Core claim

DocAtlas constructs OCR datasets and benchmarks across 82 languages and 9 tasks via dual pipelines of differential rendering from native DOCX files and synthetic LaTeX generation for right-to-left scripts, yielding unified DocTag annotations for layout, text, and component types without learned models. Direct Preference Optimization that treats these rendering-derived labels as the positive signal achieves stable adaptation, producing +1.9% in-domain and +1.8% out-of-domain accuracy improvements with no measurable degradation on base languages, whereas supervised fine-tuning degrades out-of-domain performance by up to 21%. The best resulting model improves 1.7% over the strongest baseline.

What carries the argument

Dual rendering pipelines that generate precise structural annotations in DocTag format from native and synthetic documents, used as reliable positive signals in Direct Preference Optimization for multilingual adaptation.

If this is right

  • Multilingual adaptation improves accuracy both inside and outside the training distribution.
  • Base-language performance remains unchanged after adaptation.
  • The method avoids the large out-of-domain drops produced by supervised fine-tuning.
  • The resulting models outperform prior state-of-the-art baselines on the new benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rendering-plus-DPO pipeline could be tested on additional document formats such as PDF or HTML to broaden coverage.
  • The approach may support iterative improvement by feeding model outputs back into the rendering loop for self-refinement.
  • Similar preference signals derived from rendering could be explored for other layout-sensitive tasks like table extraction or form understanding.

Load-bearing premise

Differential rendering of native documents and synthetic LaTeX generation produce precise structural annotations that serve as reliable ground truth for DPO without introducing systematic errors in layout or component typing.

What would settle it

A test set of additional low-resource languages where the DPO-adapted model shows out-of-domain degradation comparable to supervised fine-tuning would falsify the stability claim.

Figures

Figures reproduced from arXiv: 2605.12623 by Abdullah Sohail, Ahmed Heakl, Ahmed Nassar, Fahad Shahbaz Khan, Imran Razzak, Peter W. J. Staar, Rania Elbadry, Salman Khan, Youssef Mohamed.

Figure 1
Figure 1. Figure 1: Overview of the DocAtlas framework. (Left) Global script coverage across 80+ languages spanning 10 writing systems, illustrating the geographical and typological diversity of the corpus. (Right) Cross-lingual transfer performance after DPO training, showing consistent gains in both in-domain and out-of-domain accuracy across major OCR and vision-language models. Abstract Multilingual document understanding… view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end data pipelines. We implement two pipelines: a high-fidelity pipeline for native DOCX documents and a synthetic RTL pipeline for underrepresented scripts. The native pipeline extracts, filters, colorizes, and annotates Word files, while the RTL pipeline converts structured inputs (EPUB, HTML, XML) into precisely annotated PDF documents using LaTeX synthesis. over-union (IoU) containment. When com… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the DocAtlas synthetic data generation pipeline. Structured inputs (HTML, XML, DOCX, EPUB) are parsed into DocTag snippets and rendered via LATEX templates with positional logging. Through multiple compilations, the system produces aligned PDF documents and precise element-level anno￾tations (DocTag, Markdown, and visual overlays). 3.3 Benchmark We assembled a multilingual benchmark balanc￾ing … view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy distribution across high- and low￾resource languages [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: OCR accuracy across language families. Scores (brighter is better) show average performance across 14 models and 7 families. Top models (e.g., DeepseekOCR, Chandra) are consistent, while others degrade on low-resource scripts. Arabic Chinese CroatianDutchEnglish FrenchHindi Italian Polish Russian Serbian Spanish Thai Ukrainian Vietnamese Language 0 20 40 60 80 Chart Score (%) DeepseekOCR NanosetsOCR2 Gemin… view at source ↗
Figure 6
Figure 6. Figure 6: OCR accuracy across language families. Scores (brighter is better) show average performance across 14 models and 7 families. Top models (e.g., DeepseekOCR, Chandra) are consistent, while others degrade on low-resource scripts. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Chart extraction accuracy across 15 lan￾guages. Gemini-2.5-Flash achieves the highest average. Multilingual Chart Extraction Chart extrac￾tion reveals a critical divide between specialized OCR systems and general-purpose vision-language models. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: DPO gains across language families. Language family gains reveal typological pat￾terns [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Language frequency distribution in the DocAtlas corpus. The dataset exhibits a long-tailed distribution across 80+ languages, with high-resource scripts (e.g., en, ru, es) dominating the head and low-resource languages (e.g., ps, ckb, ku, azb) forming a diverse tail. 10 3 10 4 10 5 10 6 Frequency text list table heading_1 figure footer heading_2 header heading_3 heading_4 toc title form_tag quote table_cap… view at source ↗
Figure 10
Figure 10. Figure 10: Tag frequency distribution in DocAtlas. Download, Safety Filtering, and Annotation [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Perplexity-based filtering across five lan￾guages. David, Narkisim, and Frank Ruehl for Hebrew; Nazanin, Lotus, and Iranian Sans for Persian; and Nastaliq and Naskh for Urdu. Persian templates additionally support mixed LTR/RTL layouts for scientific content. Rendering and Quality Control. The synthesis engine, built on LuaTeX with custom positional logging commands, operates in three compilation passes: … view at source ↗
Figure 14
Figure 14. Figure 14: Scores vs. model scale. Each point repre￾sents a model; marker size encodes parameter count. Larger models do not uniformly dominate: several com￾pact expert systems (≤3B) match or exceed general￾purpose VLMs on both text and table scores. 8.3 Layout Robustness [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison of layout parser failure cases. Common error types (extra, overlapping, missing detections; wrong categories) are shown for Layout Parser (left) and our DocAtlas system (right). DocAtlas consistently produces more accurate and cleaner segmentation. Model Size 100 M 1B 10 B Text 100 B Model Size 100 M 1B 10 B 100 B NanosetsOCR2 Qwen3-VL MinerU2.5 Qwen2.5-VL Granite-Docling SmolDoclin… view at source ↗
Figure 15
Figure 15. Figure 15: Document type performance. 9.2 Document Type Evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
read the original abstract

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline. Code is available at https://github.com/ahmedheakl/DocAtlas .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DocAtlas, a framework for constructing high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. It employs dual annotation pipelines—differential rendering of native DOCX documents and synthetic LaTeX generation for right-to-left scripts—to produce structural annotations in a unified DocTag format encoding layout, text, and component types without learned models. Evaluation of 16 state-of-the-art models reveals persistent gaps in low-resource scripts. The central empirical result is that Direct Preference Optimization (DPO) using rendering-derived ground truth as the positive signal yields stable multilingual adaptation, with +1.9% in-domain and +1.8% out-of-domain accuracy gains and no measurable base-language degradation, while supervised fine-tuning degrades out-of-domain performance by up to 21%. The best variant (DocAtlas-DeepSeek) improves +1.7% over the strongest baseline. Code is released at the provided GitHub link.

Significance. If the annotation pipeline produces unbiased ground truth, the work provides a scalable, model-free method for creating multilingual document datasets that could reduce annotation biases in low-resource languages. The contrast between DPO's stable adaptation and SFT's degradation is a potentially useful empirical finding for multilingual fine-tuning strategies. Releasing code supports reproducibility, though the absence of statistical details limits immediate impact assessment.

major comments (2)
  1. [Abstract / Results] Abstract and results section: the headline DPO improvements (+1.9% in-domain, +1.8% OOD) and the claim of 'no measurable base-language degradation' are reported without error bars, statistical significance tests, details on data splits, or exhaustive baseline comparisons. This weakens the central adaptation claim, as the reported gains rest on limited visible evidence and could be sensitive to evaluation choices.
  2. [Annotation Pipelines] Annotation pipeline description: the assertion that differential DOCX rendering and synthetic LaTeX 'produce precise structural annotations' in DocTag format without introducing systematic errors is load-bearing for the DPO positive-signal construction. No error-rate quantification, human validation of annotations, or analysis of failure modes for low-resource/RTL scripts is referenced, leaving open the possibility that DPO gains reflect annotation artifacts rather than genuine adaptation.
minor comments (2)
  1. [Methods] Clarify the exact definition and schema of the DocTag format (e.g., how component types and layout elements are encoded) to aid reproducibility.
  2. [Experiments] The evaluation of 16 models would benefit from an explicit table listing all baselines, their training regimes, and per-language or per-task breakdowns rather than aggregate percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we will revise the manuscript to incorporate additional details and validations as suggested.

read point-by-point responses
  1. Referee: Abstract and results section: the headline DPO improvements (+1.9% in-domain, +1.8% OOD) and the claim of 'no measurable base-language degradation' are reported without error bars, statistical significance tests, details on data splits, or exhaustive baseline comparisons. This weakens the central adaptation claim, as the reported gains rest on limited visible evidence and could be sensitive to evaluation choices.

    Authors: We agree with the referee that the central empirical claims would benefit from greater statistical transparency. In the revised manuscript, we will add error bars derived from repeated evaluations across different random seeds for the reported accuracy gains. We will also include results from statistical significance testing (such as McNemar's test or paired t-tests where applicable) to evaluate the improvements. Furthermore, we will provide detailed descriptions of the data splits used for in-domain and out-of-domain assessments and include additional baseline models and ablation studies to make the comparisons more exhaustive. These changes will address the concern regarding the robustness of the adaptation results. revision: yes

  2. Referee: Annotation pipeline description: the assertion that differential DOCX rendering and synthetic LaTeX 'produce precise structural annotations' in DocTag format without introducing systematic errors is load-bearing for the DPO positive-signal construction. No error-rate quantification, human validation of annotations, or analysis of failure modes for low-resource/RTL scripts is referenced, leaving open the possibility that DPO gains reflect annotation artifacts rather than genuine adaptation.

    Authors: The annotation pipelines are constructed to be deterministic and free of learned components to minimize bias introduction. Differential rendering from DOCX files directly captures the structural elements, and the LaTeX-based approach for RTL scripts ensures accurate text and layout rendering through standard compilation. Nevertheless, we recognize the value of explicit validation. We will revise the manuscript to include an analysis of annotation quality, featuring error rates computed against human-annotated samples for a diverse set of languages including low-resource and RTL scripts. We will also provide a discussion of potential failure modes and how the pipelines handle them. The open-sourced code will facilitate community verification of these aspects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmarks and independent rendering pipeline.

full rationale

The manuscript introduces DocAtlas as a data-construction framework using differential DOCX rendering and synthetic LaTeX generation to produce DocTag annotations, then reports empirical accuracy numbers for 16 models plus DPO adaptation. No equations, fitted parameters, or first-principles derivations appear; the reported +1.9 % / +1.8 % gains and the contrast with SFT are direct experimental comparisons against external baselines rather than quantities defined in terms of the same data or self-citations. The ground-truth pipeline is presented as an external, non-learned process, so the central claims do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework depends on the assumption that rendering pipelines yield unbiased ground truth and that DPO can leverage this signal without introducing new biases; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Differential rendering of native DOCX documents produces precise structural annotations without learned models
    Invoked as the source of ground truth for both dataset creation and DPO positive signals
  • domain assumption Synthetic LaTeX generation accurately captures right-to-left script layout and component types
    Required for the second pipeline to cover RTL languages reliably
invented entities (1)
  • DocTag format no independent evidence
    purpose: Unified encoding of layout, text, and component types
    New annotation schema introduced to standardize output across pipelines

pith-pipeline@v0.9.0 · 5758 in / 1428 out tokens · 45570 ms · 2026-05-22T09:47:55.663777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 13 internal anchors

  1. [1]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. arXiv preprint arXiv:2409.01704 , year=

  2. [2]

    olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

    olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models , author=. arXiv preprint arXiv:2502.18443 , year=

  3. [3]

    arXiv preprint arXiv:2506.05218 , year=

    MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm , author=. arXiv preprint arXiv:2506.05218 , year=

  4. [4]

    2025 , note=

    Mistral OCR , author=. 2025 , note=

  5. [5]

    DeepSeek-OCR: Contexts Optical Compression

    Deepseek-ocr: Contexts optical compression , author=. arXiv preprint arXiv:2510.18234 , year=

  6. [6]

    International Conference on Learning Representations , volume=

    Nougat: Neural optical understanding for academic documents , author=. International Conference on Learning Representations , volume=

  7. [7]

    arXiv preprint arXiv:2408.09869 , year=

    Docling technical report , author=. arXiv preprint arXiv:2408.09869 , year=

  8. [8]

    Granite Docling: A 258M-Parameter Multimodal VLM for Document Understanding , author=

  9. [9]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  10. [10]

    Marker: Fast and Accurate PDF to Markdown Converter , author=

  11. [11]

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. arXiv preprint arXiv:2409.18839 , year=

  12. [12]

    5: A decoupled vision-language model for efficient high-resolution document parsing , author=

    Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing , author=. The 64th Annual Meeting of the Association for Computational Linguistics--Industry Track , year=

  13. [13]

    dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=

  14. [14]

    Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

    Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting , author=. Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

  15. [15]

    Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

  16. [16]

    Mathpix OCR API , author=

  17. [17]

    Pix2Text: An Open-Source Tool for Recognizing Layouts, Tables, Math Formulas, and Text in Images , author=

  18. [18]

    OCRFlux: Mastering Complex Layouts and Seamless Page Merging , author=

  19. [19]

    Unstructured: Open-Source Pre-Processing Tools for Unstructured Data , author=

  20. [20]

    OpenParse: Visually-Driven Document Parser for LLM Ingestion , author=

  21. [21]

    2025 , note=

    PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , note=

  22. [22]

    PaddleOCR 3.0 Technical Report

    PaddleOCR 3.0 Technical Report , author=. arXiv preprint arXiv:2507.05595 , year=

  23. [23]

    Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

    Ocean-OCR: Towards General OCR Application via a Vision-Language Model , author=. arXiv preprint arXiv:2501.15558 , year=

  24. [24]

    arXiv preprint arXiv:2509.01215 , year=

    POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion , author=. arXiv preprint arXiv:2509.01215 , year=

  25. [25]

    IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

    DocVQA: A Dataset for VQA on Document Images , author=. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

  26. [26]

    2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=

    FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , author=. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=. 2019 , organization=. doi:10.1109/ICDARW.2019.10029 , note=

  27. [27]

    2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=

    ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT , author=. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=. 2017 , organization=

  28. [28]

    Findings of the Association for Computational Linguistics: ACL 2022 , pages=

    XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=. 2022 , doi=

  29. [29]

    2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=

    PubLayNet: Largest Dataset Ever for Document Layout Analysis , author=. 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=. 2019 , organization=. doi:10.1109/ICDAR.2019.00166 , note=

  30. [30]

    arXiv preprint arXiv:2206.01062 , year=

    DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis , author=. arXiv preprint arXiv:2206.01062 , year=

  31. [31]

    OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

    OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models , author=. arXiv preprint arXiv:2305.07895 , year=

  32. [32]

    ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

    ICDAR 2019 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Recognition - RRC-MLT-2019 , author=. arXiv preprint arXiv:1907.00945 , year=

  33. [33]

    Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

    LayoutLM: Pre-training of Text and Layout for Document Image Understanding , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=. 2020 , doi=

  34. [34]

    arXiv preprint arXiv:2012.14740 , year=

    LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2012.14740 , year=

  35. [35]

    Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=

    LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86549-8_9 , note=

  36. [36]

    arXiv preprint arXiv:2103.15992 , year=

    A Multiplexed Network for End-to-End, Multilingual OCR , author=. arXiv preprint arXiv:2103.15992 , year=

  37. [37]

    arXiv preprint arXiv:2410.16153 , year=

    Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages , author=. arXiv preprint arXiv:2410.16153 , year=

  38. [38]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2023 , note=

  39. [39]

    arXiv preprint arXiv:2104.08836 , year=

    LayoutXLM: Multimodal Pre-training for Multilingual Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2104.08836 , year=

  40. [40]

    arXiv preprint arXiv:2412.17787 , year=

    Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective , author=. arXiv preprint arXiv:2412.17787 , year=

  41. [41]

    Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=

    SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86337-1_8 , note=

  42. [42]

    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    A Synthetic Recipe for OCR , author=. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

  43. [43]

    arXiv preprint arXiv:2103.08236 , year=

    Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs , author=. arXiv preprint arXiv:2103.08236 , year=

  44. [44]

    Journal of Imaging , volume=

    DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images , author=. Journal of Imaging , volume=. 2017 , publisher=

  45. [45]

    TextRecognitionDataGenerator , author=

  46. [46]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  47. [47]

    Qwen2.5-VL Technical Report

    Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  48. [48]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  49. [49]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  50. [50]

    AAAI Conference on Artificial Intelligence , year=

    Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion , author=. AAAI Conference on Artificial Intelligence , year=

  51. [51]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  52. [52]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

  53. [53]

    Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

  54. [54]

    2025 , version =

    Datalab To , title =. 2025 , version =

  55. [55]

    European conference on computer vision , pages=

    Image-based table recognition: data, model, and evaluation , author=. European conference on computer vision , pages=. 2020 , organization=

  56. [56]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Readoc: A unified benchmark for realistic document structured extraction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  57. [57]

    Docbank: A bench- mark dataset for document layout analysis

    Docbank: A benchmark dataset for document layout analysis , author=. arXiv preprint arXiv:2006.01038 , year=

  58. [58]

    arXiv preprint arXiv:2210.05391 , year=

    Pp-structurev2: A stronger document analysis system , author=. arXiv preprint arXiv:2210.05391 , year=

  59. [59]

    European Conference on Computer Vision , pages=

    Ocr-free document understanding transformer , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  60. [60]

    International Conference on Machine Learning , pages=

    Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  61. [61]

    Proceedings of the national academy of sciences , volume=

    Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

  62. [62]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  63. [63]

    IEEE Transactions on Audio, Speech and Language Processing , year=

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , year=

  64. [64]

    arXiv preprint arXiv:2507.06761 , year=

    Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu , author=. arXiv preprint arXiv:2507.06761 , year=

  65. [65]

    European Conference on Computer Vision , pages=

    Task grouping for multilingual text recognition , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  66. [66]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Kitab-bench: A comprehensive multi-domain benchmark for arabic ocr and document understanding , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  67. [67]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  68. [68]

    How to Teach Large Multimodal Models New Skills

    How to Teach Large Multimodal Models New Skills , author=. arXiv preprint arXiv:2510.08564 , year=

  69. [69]

    , title =

    Bradski, G. , title =. Dr. Dobb's Journal of Software Tools , year =

  70. [70]

    2024 , howpublished =

  71. [71]

    FastText.zip: Compressing text classification models

    Fasttext. zip: Compressing text classification models , author=. arXiv preprint arXiv:1612.03651 , year=

  72. [72]

    Proceedings of the 30th ACM international conference on multimedia , pages=

    Layoutlmv3: Pre-training for document ai with unified text and image masking , author=. Proceedings of the 30th ACM international conference on multimedia , pages=

  73. [73]

    Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

    CCNet: Extracting high quality monolingual datasets from web crawl data , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

  74. [74]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  75. [75]

    IEEE Transactions on Big Data , year=

    The faiss library , author=. IEEE Transactions on Big Data , year=

  76. [76]

    HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data, February , volume=

    Common Crawl: Data collection and use cases for NLP , author=. HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data, February , volume=

  77. [77]

    Advances in Neural Information Processing Systems , volume=

    WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data , author=. Advances in Neural Information Processing Systems , volume=

  78. [78]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    From chaotic ocr words to coherent document: A fine-to-coarse zoom-out network for complex-layout document image translation , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=