Donut: Document understanding transformer without OCR

Kim, G · 2021 · arXiv 2111.15664

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

cs.CV · 2026-05-19 · conditional · novelty 7.0

Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

cs.CV · 2026-03-20 · unverdicted · novelty 7.0

A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

cs.CL · 2026-02-02 · unverdicted · novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription

cs.LG · 2025-02-27 · unverdicted · novelty 6.0

Introduces OCR+PAGE-1 and OCR+PAGE-N prompting strategies that improve zero-shot multi-page handwritten document transcription by sharing context across pages.

Nougat: Neural Optical Understanding for Academic Documents

cs.LG · 2023-08-25 · conditional · novelty 6.0

Nougat applies a visual transformer to convert academic PDFs into markup language while accurately handling mathematical content on a new scientific document dataset.

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting attention latency by 3x.

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

cs.AI · 2026-05-16 · conditional · novelty 4.0

MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.

From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms

cs.CV · 2026-04-14 · unverdicted · novelty 4.0

Frontier multimodal LLMs achieve ~85% accuracy and ~90% weighted F1 on digitizing complex handwritten medical forms, with Gemini 3.1 strongest overall and prompt optimization lifting macro metrics over 60%.

Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

cs.CV · 2026-04-03 · unverdicted · novelty 3.0

MolSeek-OCR reaches exact SMILES matching accuracy comparable to leading image-to-sequence OCSR models after two-stage fine-tuning on PubChem renderings and USPTO-MOL patent images, but remains below image-to-graph state-of-the-art.

citing papers explorer

Showing 10 of 10 citing papers.

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding cs.CV · 2026-05-19 · conditional · none · ref 2
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale cs.CV · 2026-04-06 · unverdicted · none · ref 16
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models cs.CV · 2026-03-20 · unverdicted · none · ref 21
A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding cs.CL · 2026-02-02 · unverdicted · none · ref 52
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription cs.LG · 2025-02-27 · unverdicted · none · ref 21
Introduces OCR+PAGE-1 and OCR+PAGE-N prompting strategies that improve zero-shot multi-page handwritten document transcription by sharing context across pages.
Nougat: Neural Optical Understanding for Academic Documents cs.LG · 2023-08-25 · conditional · none · ref 30
Nougat applies a visual transformer to convert academic PDFs into markup language while accurately handling mathematical content on a new scientific document dataset.
FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing cs.CV · 2026-05-17 · unverdicted · none · ref 11
FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting attention latency by 3x.
MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop cs.AI · 2026-05-16 · conditional · none · ref 17
MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.
From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms cs.CV · 2026-04-14 · unverdicted · none · ref 5
Frontier multimodal LLMs achieve ~85% accuracy and ~90% weighted F1 on digitizing complex handwritten medical forms, with Gemini 3.1 strongest overall and prompt optimization lifting macro metrics over 60%.
Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition cs.CV · 2026-04-03 · unverdicted · none · ref 3
MolSeek-OCR reaches exact SMILES matching accuracy comparable to leading image-to-sequence OCSR models after two-stage fine-tuning on PubChem renderings and USPTO-MOL patent images, but remains below image-to-graph state-of-the-art.

Donut: Document understanding transformer without OCR

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer