Prefix-Adaptive Block Diffusion for Efficient Document Recognition

Chenyu Liu; Dingwei Zhu; Jiazheng Zhang; Jihua Kang; Jun Long; Kaidi Zhang; Mingxu Chai; Qi Zhang; Ruoyu Chen; Tao Gui

arxiv: 2605.16861 · v1 · pith:W56OMABInew · submitted 2026-05-16 · 💻 cs.CV · cs.AI

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

Mingxu Chai , Ziyu Shen , Chenyu Liu , Kaidi Zhang , Jiazheng Zhang , Dingwei Zhu , Zhiheng Xi , Ruoyu Chen

show 4 more authors

Jun Long Jihua Kang Tao Gui Qi Zhang

This is my paper

Pith reviewed 2026-05-19 20:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords block diffusion modelsdocument recognitioncausal denoisingprefix commitmentefficient inferenceKV cachestructural loss

0 comments

The pith

Prefix-Adaptive Block Diffusion replaces fixed block boundaries with causal prefix denoising and dynamic commitment to fix information conflicts in document parsing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that block diffusion models can generate documents more efficiently by changing intra-block denoising from bidirectional to causal prefix-to-suffix order and by treating blocks as flexible ranges instead of rigid units. It introduces Confidence-gated Structural Loss to create stable prefixes during training and Progressive Prefix Commitment to move the longest reliable prefix into the KV cache at inference time, resetting the next range from there. A sympathetic reader would care because current block diffusion approaches lose parallelism and face inconsistent information flow between blocks, which hurts both speed and accuracy on structure-heavy tasks like document recognition. If the changes work, larger parallel decoding spaces become available at each step without sacrificing the ability to handle variable-length outputs.

Core claim

By switching to causal denoising inside blocks and using Progressive Prefix Commitment to dynamically commit reliable prefixes to the cache, the Prefix-Adaptive Block Diffusion Model restores large parallel decoding spaces at every step while maintaining consistent information flow between intra-block and inter-block generation.

What carries the argument

Progressive Prefix Commitment, which identifies the longest reliable prefix via confidence scores, commits it to the KV cache, and resets the next candidate block range from the updated prefix position.

If this is right

Intra-block parallelism no longer shrinks as denoising proceeds because the direction is now strictly prefix to suffix.
Generated tokens enter the KV cache as soon as a reliable prefix is confirmed rather than waiting for an entire block.
The model can sustain larger parallel decoding windows throughout inference instead of progressively restricting them.
Training first builds low-entropy prefixes before extending to longer sequences, which aligns the learned distribution with the causal inference path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefix-commitment logic could be tested on other structured generation tasks such as code or table layout where order matters.
If the dynamic range reset works reliably, it may allow training larger models without increasing memory use during inference.
One could measure whether the method reduces the rate of format violations like misplaced table cells compared with fixed-block baselines.

Load-bearing premise

That switching to causal prefix-to-suffix denoising plus the confidence-gated loss and progressive commitment will remove the information-flow conflict without creating new structural errors in recognition.

What would settle it

Running the 3B PA-BDM on a document benchmark and finding either lower recognition accuracy than the 2.5B baseline or no throughput gain when measuring tokens generated per second.

Figures

Figures reproduced from arXiv: 2605.16861 by Chenyu Liu, Dingwei Zhu, Jiazheng Zhang, Jihua Kang, Jun Long, Kaidi Zhang, Mingxu Chai, Qi Zhang, Ruoyu Chen, Tao Gui, Zhiheng Xi, Ziyu Shen.

**Figure 1.** Figure 1: Unlike standard block diffusion models that cache only after completing an entire block, our method treats [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Training and inference of PA-BDM. (a) During training, PA-BDM concatenates noisy and clean sequences, applies causal block attention, and uses CSL to supervise as many masked tokens as allowed by prefix confidence. (b) During inference, PA-BDM treats the block size as a maximum candidate range. PPC selects a committed prefix, materializes its KV states while predicting the next candidate range, and resets … view at source ↗

**Figure 3.** Figure 3: Accuracy–efficiency trade-off of PA-BDM across model scales. The x-axis denotes the PPC confidence threshold. Lines show accuracy, and bars show inference throughput (TPS). Additional hyperparameter studies are provided in the Appendix D. Block Size Formula ↑ Text ↓ Table ↑ Bidir. Causal Bidir. Causal Bidir. Causal 8 76.2 87.1 0.214 0.197 75.3 83.5 16 69.2 78.0 0.223 0.226 61.7 74.2 32 31.4 27.5 0.271 0.2… view at source ↗

**Figure 5.** Figure 5: The red line shows the ACC across different [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: A case study of PA-BDM on mathematical formula recognition using adaptive step-size decoding. The number in the top-left corner of each slot indicates the generation order, while the color intensity within each slot represents the generation time (darker indicates earlier). to improve both structural stability and decoding efficiency. E Batch-parallel PPC Decoding Confidence-based block diffusion decoding … view at source ↗

read the original abstract

Block Diffusion Models (BDMs) support parallel generation, flexible-length output, and KV caching, making them promising for efficient document parsing. However, existing BDMs bind denoising and cache commitment to fixed block boundaries: parallelism shrinks during intra-block denoising, while generated tokens cannot be cached until the whole block is completed. Moreover, intra-block bidirectional denoising conflicts with inter-block autoregression, creating inconsistent information flow that can challenge structure-sensitive recognition. We propose the Prefix-Adaptive Block Diffusion Model (PA-BDM), which replaces intra-block bidirectional denoising with causal denoising from prefix to suffix and treats the block size as a maximum candidate range rather than a fixed commitment unit. PA-BDM uses Confidence-gated Structural Loss (CSL) to build low-entropy prefixes before extending training to longer continuations. During inference, Progressive Prefix Commitment (PPC) then dynamically commits the longest reliable prefix into the KV cache and resets the next candidate range from the updated prefix, restoring a large parallel decoding space at each step. Experiments show that the 3B PA-BDM achieves higher recognition scores on several benchmarks and improves inference throughput by 71.6\% over the 2.5B MinerU-Diffusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PA-BDM swaps bidirectional intra-block denoising for causal prefix-to-suffix flow plus dynamic prefix commitment to lift throughput in document recognition, but the supporting experiments stay light on controls and failure analysis.

read the letter

This paper adapts block diffusion models for document recognition by replacing bidirectional denoising inside blocks with causal prefix-to-suffix processing and treating block size as a flexible maximum range instead of a fixed unit. They add Confidence-gated Structural Loss to train low-entropy prefixes and Progressive Prefix Commitment at inference to commit reliable prefixes early and reset the next range, which they say restores parallelism while cutting the information-flow clash between blocks.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Prefix-Adaptive Block Diffusion Model (PA-BDM) for document recognition tasks. It modifies existing Block Diffusion Models by replacing intra-block bidirectional denoising with causal prefix-to-suffix denoising, treating block size as a maximum candidate range, and adding Confidence-gated Structural Loss (CSL) to construct low-entropy prefixes during training plus Progressive Prefix Commitment (PPC) to dynamically commit reliable prefixes to the KV cache at inference time. The central empirical claim is that a 3B-parameter PA-BDM attains higher recognition scores on several benchmarks while delivering a 71.6% inference throughput improvement relative to the 2.5B MinerU-Diffusion baseline.

Significance. If the reported accuracy gains and throughput improvements prove robust under controlled conditions with proper statistical controls, the work would offer a practical advance in efficient, parallelizable generation for layout-sensitive document parsing. The adaptive commitment strategy provides a concrete mechanism for reconciling intra-block causality with inter-block autoregression, which could influence subsequent diffusion-based structured prediction methods.

major comments (3)

[§3] §3 (Method, description of CSL and PPC): The claim that causal prefix-to-suffix denoising together with CSL and PPC eliminates the original information-flow conflict without introducing new structure errors is load-bearing for the central thesis, yet the manuscript supplies no ablation that isolates the contribution of CSL/PPC to global layout consistency on tables, multi-column text, or hierarchical headings.
[§5] §5 (Experiments): The abstract states that the 3B PA-BDM achieves higher recognition scores and a 71.6% throughput gain over the 2.5B MinerU-Diffusion, but reports neither error bars, dataset cardinalities, nor explicit confirmation that throughput measurements were performed under identical hardware, batch-size, and caching conditions; this absence directly weakens verification of the efficiency claim.
[§3.3] §3.3 (Inference procedure): The Progressive Prefix Commitment step assumes that confidence estimates remain reliable on ambiguous structural elements; no targeted failure-case analysis or comparison against non-causal baselines is provided to test whether premature prefix commitment can lock in incorrect alignments that later tokens cannot recover.

minor comments (2)

The abstract would be clearer if it briefly stated the exact benchmark datasets and the precise definition of throughput (tokens per second, images per second, etc.).
A small diagram illustrating the evolution of the candidate range under PPC would improve readability of the inference algorithm.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method, description of CSL and PPC): The claim that causal prefix-to-suffix denoising together with CSL and PPC eliminates the original information-flow conflict without introducing new structure errors is load-bearing for the central thesis, yet the manuscript supplies no ablation that isolates the contribution of CSL/PPC to global layout consistency on tables, multi-column text, or hierarchical headings.

Authors: We agree that an ablation isolating the contributions of CSL and PPC would strengthen the evidence for their role in preserving global layout consistency. In the revised manuscript we will add a targeted ablation study comparing variants with and without CSL/PPC, reporting layout-consistency metrics on tables, multi-column text, and hierarchical headings. revision: yes
Referee: [§5] §5 (Experiments): The abstract states that the 3B PA-BDM achieves higher recognition scores and a 71.6% throughput gain over the 2.5B MinerU-Diffusion, but reports neither error bars, dataset cardinalities, nor explicit confirmation that throughput measurements were performed under identical hardware, batch-size, and caching conditions; this absence directly weakens verification of the efficiency claim.

Authors: We acknowledge that the current reporting lacks the statistical and procedural details needed for full verification. We will revise the experiments section to include error bars from multiple runs, report dataset cardinalities explicitly, and provide a clear description of the hardware, batch sizes, and caching conditions used for all throughput measurements. revision: yes
Referee: [§3.3] §3.3 (Inference procedure): The Progressive Prefix Commitment step assumes that confidence estimates remain reliable on ambiguous structural elements; no targeted failure-case analysis or comparison against non-causal baselines is provided to test whether premature prefix commitment can lock in incorrect alignments that later tokens cannot recover.

Authors: We recognize that the reliability of confidence estimates under structural ambiguity is an important open question. We will add a failure-case analysis section together with direct comparisons against non-causal baselines to evaluate whether premature commitment can produce irrecoverable alignment errors. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces PA-BDM by describing architectural changes (causal prefix-to-suffix denoising, CSL, PPC) and reports empirical benchmark results for recognition accuracy and throughput gains. No equations, parameter-fitting steps, or self-citations appear in the provided abstract or text that would reduce any claimed prediction or result to an input quantity by construction. The central claims rest on experimental outcomes rather than a closed derivation chain, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of concrete free parameters, axioms, or invented entities; the model introduces CSL and PPC as new components whose internal hyperparameters and assumptions remain unspecified.

pith-pipeline@v0.9.0 · 5774 in / 1128 out tokens · 43988 ms · 2026-05-19T20:31:23.947249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · 1 internal anchor

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

2023 , eprint=

DocTr: Document Transformer for Structured Information Extraction in Documents , author=. 2023 , eprint=

work page 2023
[9]

2024 , eprint=

OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition , author=. 2024 , eprint=

work page 2024
[10]

2024 , eprint=

Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition , author=. 2024 , eprint=

work page 2024
[11]

2025 , eprint=

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents , author=. 2025 , eprint=

work page 2025
[12]

2025 , eprint=

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion , author=. 2025 , eprint=

work page 2025
[13]

2024 , eprint=

Retrieval-Augmented Generation for AI-Generated Content: A Survey , author=. 2024 , eprint=

work page 2024
[14]

2024 , eprint=

MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. 2024 , eprint=

work page 2024
[15]

2025 , note =

Vik Paruchuri , title =. 2025 , note =

work page 2025
[16]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[17]

2023 , eprint=

Nougat: Neural Optical Understanding for Academic Documents , author=. 2023 , eprint=

work page 2023
[18]

2024 , eprint=

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. 2024 , eprint=

work page 2024
[19]

2025 , eprint=

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting , author=. 2025 , eprint=

work page 2025
[20]

2025 , eprint=

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

DocFusion: A Unified Framework for Document Parsing Tasks , author=. 2025 , eprint=

work page 2025
[23]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

work page 2024
[24]

2023 , eprint=

Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023
[25]

2021 , eprint=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2021 , eprint=

work page 2021
[26]

2021 , eprint=

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author=. 2021 , eprint=

work page 2021
[27]

2020 , eprint=

Multilingual Denoising Pre-training for Neural Machine Translation , author=. 2020 , eprint=

work page 2020
[28]

2024 , eprint=

Qwen2 Technical Report , author=. 2024 , eprint=

work page 2024
[29]

2024 , eprint=

UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition , author=. 2024 , eprint=

work page 2024
[30]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[31]

2023 , eprint=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , eprint=

work page 2023
[32]

2024 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2024 , eprint=

work page 2024
[33]

2017 , eprint=

Feature Pyramid Networks for Object Detection , author=. 2017 , eprint=

work page 2017
[34]

2021 , eprint=

Deformable DETR: Deformable Transformers for End-to-End Object Detection , author=. 2021 , eprint=

work page 2021
[35]

2017 , eprint=

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , author=. 2017 , eprint=

work page 2017
[36]

2022 , eprint=

DaViT: Dual Attention Vision Transformers , author=. 2022 , eprint=

work page 2022
[37]

2019 , eprint=

Generating Long Sequences with Sparse Transformers , author=. 2019 , eprint=

work page 2019
[38]

2021 , eprint=

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , author=. 2021 , eprint=

work page 2021
[39]

2017 , eprint=

Submanifold Sparse Convolutional Networks , author=. 2017 , eprint=

work page 2017
[40]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021
[41]

2022 , eprint=

Flamingo: a Visual Language Model for Few-Shot Learning , author=. 2022 , eprint=

work page 2022
[42]

2023 , eprint=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

work page 2023
[43]

2023 , eprint=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. 2023 , eprint=

work page 2023
[44]

2024 , eprint=

CogVLM: Visual Expert for Pretrained Language Models , author=. 2024 , eprint=

work page 2024
[45]

2025 , eprint=

Locality Alignment Improves Vision-Language Models , author=. 2025 , eprint=

work page 2025
[46]

2025 , eprint=

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding , author=. 2025 , eprint=

work page 2025
[47]

2023 , eprint=

UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning , author=. 2023 , eprint=

work page 2023
[48]

2024 , eprint=

Attention Prompting on Image for Large Vision-Language Models , author=. 2024 , eprint=

work page 2024
[49]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Cao, Yun-Hao and Ji, Kaixiang and Huang, Ziyuan and Zheng, Chuanyang and Liu, Jiajia and Wang, Jian and Chen, Jingdong and Yang, Ming , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[50]

2024 , eprint=

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception , author=. 2024 , eprint=

work page 2024
[51]

2023 , eprint=

Vision Grid Transformer for Document Layout Analysis , author=. 2023 , eprint=

work page 2023
[52]

2025 , eprint=

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm , author=. 2025 , eprint=

work page 2025
[53]

2025 , eprint=

SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding , author=. 2025 , eprint=

work page 2025
[54]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

work page
[55]

2024 , publisher =

OleehyO , title =. 2024 , publisher =

work page 2024
[56]

arXiv preprint arXiv:1911.10683 , year=

Image-based table recognition: data, model, and evaluation , author=. arXiv preprint arXiv:1911.10683 , year=

work page arXiv 1911
[57]

2020 , eprint=

Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , author=. 2020 , eprint=

work page 2020
[58]

and Staar, Peter , title =

Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S. and Staar, Peter , year=. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation , url=. doi:10.1145/3534678.3539043 , booktitle=

work page doi:10.1145/3534678.3539043
[59]

2024 , publisher =

Daeun004 , title =. 2024 , publisher =

work page 2024
[60]

2024 , eprint=

CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation , author=. 2024 , eprint=

work page 2024
[61]

, language=

Levenshtein, V.I. , language=. Binary codes capable of correcting deletions, insertions and reversals , journal=

work page
[62]

2020 , eprint=

Image-based table recognition: data, model, and evaluation , author=. 2020 , eprint=

work page 2020
[63]

2025 , eprint=

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations , author=. 2025 , eprint=

work page 2025
[64]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

work page 2024
[65]

DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR: Contexts Optical Compression , author=. arXiv preprint arXiv:2510.18234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

2025 , eprint=

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction , author=. 2025 , eprint=

work page 2025
[67]

2024 , eprint=

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , author=. 2024 , eprint=

work page 2024
[68]

Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

work page
[69]

2025 , eprint=

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=. 2025 , eprint=

work page 2025
[70]

2019 , eprint=

Brno Mobile OCR Dataset , author=. 2019 , eprint=

work page 2019
[71]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025
[72]

2024 , eprint=

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding , author=. 2024 , eprint=

work page 2024
[73]

2025 , eprint=

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification , author=. 2025 , eprint=

work page 2025
[74]

2026 , eprint=

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing , author=. 2026 , eprint=

work page 2026
[75]

2023 , eprint=

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution , author=. 2023 , eprint=

work page 2023
[76]

2023 , booktitle =

Vision Grid Transformer for Document Layout Analysis , author=. 2023 , booktitle =

work page 2023
[77]

2021 , eprint=

DocVQA: A Dataset for VQA on Document Images , author=. 2021 , eprint=

work page 2021
[78]

2022 , eprint=

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , author=. 2022 , eprint=

work page 2022
[79]

nocaps: novel object captioning at scale , url=

Agrawal, Harsh and Desai, Karan and Wang, Yufei and Chen, Xinlei and Jain, Rishabh and Johnson, Mark and Batra, Dhruv and Parikh, Devi and Lee, Stefan and Anderson, Peter , year=. nocaps: novel object captioning at scale , url=. doi:10.1109/iccv.2019.00904 , booktitle=

work page doi:10.1109/iccv.2019.00904 2019
[80]

2025 , eprint=

PaddleOCR 3.0 Technical Report , author=. 2025 , eprint=

work page 2025

Showing first 80 references.

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[8] [8]

2023 , eprint=

DocTr: Document Transformer for Structured Information Extraction in Documents , author=. 2023 , eprint=

work page 2023

[9] [9]

2024 , eprint=

OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition , author=. 2024 , eprint=

work page 2024

[10] [10]

2024 , eprint=

Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition , author=. 2024 , eprint=

work page 2024

[11] [11]

2025 , eprint=

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents , author=. 2025 , eprint=

work page 2025

[12] [12]

2025 , eprint=

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion , author=. 2025 , eprint=

work page 2025

[13] [13]

2024 , eprint=

Retrieval-Augmented Generation for AI-Generated Content: A Survey , author=. 2024 , eprint=

work page 2024

[14] [14]

2024 , eprint=

MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. 2024 , eprint=

work page 2024

[15] [15]

2025 , note =

Vik Paruchuri , title =. 2025 , note =

work page 2025

[16] [16]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025

[17] [17]

2023 , eprint=

Nougat: Neural Optical Understanding for Academic Documents , author=. 2023 , eprint=

work page 2023

[18] [18]

2024 , eprint=

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. 2024 , eprint=

work page 2024

[19] [19]

2025 , eprint=

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting , author=. 2025 , eprint=

work page 2025

[20] [20]

2025 , eprint=

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. 2025 , eprint=

work page 2025

[21] [21]

2025 , eprint=

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , eprint=

work page 2025

[22] [22]

2025 , eprint=

DocFusion: A Unified Framework for Document Parsing Tasks , author=. 2025 , eprint=

work page 2025

[23] [23]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

work page 2024

[24] [24]

2023 , eprint=

Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023

[25] [25]

2021 , eprint=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2021 , eprint=

work page 2021

[26] [26]

2021 , eprint=

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author=. 2021 , eprint=

work page 2021

[27] [27]

2020 , eprint=

Multilingual Denoising Pre-training for Neural Machine Translation , author=. 2020 , eprint=

work page 2020

[28] [28]

2024 , eprint=

Qwen2 Technical Report , author=. 2024 , eprint=

work page 2024

[29] [29]

2024 , eprint=

UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition , author=. 2024 , eprint=

work page 2024

[30] [30]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[31] [31]

2023 , eprint=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , eprint=

work page 2023

[32] [32]

2024 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2024 , eprint=

work page 2024

[33] [33]

2017 , eprint=

Feature Pyramid Networks for Object Detection , author=. 2017 , eprint=

work page 2017

[34] [34]

2021 , eprint=

Deformable DETR: Deformable Transformers for End-to-End Object Detection , author=. 2021 , eprint=

work page 2021

[35] [35]

2017 , eprint=

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , author=. 2017 , eprint=

work page 2017

[36] [36]

2022 , eprint=

DaViT: Dual Attention Vision Transformers , author=. 2022 , eprint=

work page 2022

[37] [37]

2019 , eprint=

Generating Long Sequences with Sparse Transformers , author=. 2019 , eprint=

work page 2019

[38] [38]

2021 , eprint=

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , author=. 2021 , eprint=

work page 2021

[39] [39]

2017 , eprint=

Submanifold Sparse Convolutional Networks , author=. 2017 , eprint=

work page 2017

[40] [40]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021

[41] [41]

2022 , eprint=

Flamingo: a Visual Language Model for Few-Shot Learning , author=. 2022 , eprint=

work page 2022

[42] [42]

2023 , eprint=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

work page 2023

[43] [43]

2023 , eprint=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. 2023 , eprint=

work page 2023

[44] [44]

2024 , eprint=

CogVLM: Visual Expert for Pretrained Language Models , author=. 2024 , eprint=

work page 2024

[45] [45]

2025 , eprint=

Locality Alignment Improves Vision-Language Models , author=. 2025 , eprint=

work page 2025

[46] [46]

2025 , eprint=

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding , author=. 2025 , eprint=

work page 2025

[47] [47]

2023 , eprint=

UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning , author=. 2023 , eprint=

work page 2023

[48] [48]

2024 , eprint=

Attention Prompting on Image for Large Vision-Language Models , author=. 2024 , eprint=

work page 2024

[49] [49]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Cao, Yun-Hao and Ji, Kaixiang and Huang, Ziyuan and Zheng, Chuanyang and Liu, Jiajia and Wang, Jian and Chen, Jingdong and Yang, Ming , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024

[50] [50]

2024 , eprint=

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception , author=. 2024 , eprint=

work page 2024

[51] [51]

2023 , eprint=

Vision Grid Transformer for Document Layout Analysis , author=. 2023 , eprint=

work page 2023

[52] [52]

2025 , eprint=

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm , author=. 2025 , eprint=

work page 2025

[53] [53]

2025 , eprint=

SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding , author=. 2025 , eprint=

work page 2025

[54] [54]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

work page

[55] [55]

2024 , publisher =

OleehyO , title =. 2024 , publisher =

work page 2024

[56] [56]

arXiv preprint arXiv:1911.10683 , year=

Image-based table recognition: data, model, and evaluation , author=. arXiv preprint arXiv:1911.10683 , year=

work page arXiv 1911

[57] [57]

2020 , eprint=

Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , author=. 2020 , eprint=

work page 2020

[58] [58]

and Staar, Peter , title =

Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S. and Staar, Peter , year=. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation , url=. doi:10.1145/3534678.3539043 , booktitle=

work page doi:10.1145/3534678.3539043

[59] [59]

2024 , publisher =

Daeun004 , title =. 2024 , publisher =

work page 2024

[60] [60]

2024 , eprint=

CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation , author=. 2024 , eprint=

work page 2024

[61] [61]

, language=

Levenshtein, V.I. , language=. Binary codes capable of correcting deletions, insertions and reversals , journal=

work page

[62] [62]

2020 , eprint=

Image-based table recognition: data, model, and evaluation , author=. 2020 , eprint=

work page 2020

[63] [63]

2025 , eprint=

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations , author=. 2025 , eprint=

work page 2025

[64] [64]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

work page 2024

[65] [65]

DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR: Contexts Optical Compression , author=. arXiv preprint arXiv:2510.18234 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

2025 , eprint=

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction , author=. 2025 , eprint=

work page 2025

[67] [67]

2024 , eprint=

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , author=. 2024 , eprint=

work page 2024

[68] [68]

Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

work page

[69] [69]

2025 , eprint=

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=. 2025 , eprint=

work page 2025

[70] [70]

2019 , eprint=

Brno Mobile OCR Dataset , author=. 2019 , eprint=

work page 2019

[71] [71]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025

[72] [72]

2024 , eprint=

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding , author=. 2024 , eprint=

work page 2024

[73] [73]

2025 , eprint=

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification , author=. 2025 , eprint=

work page 2025

[74] [74]

2026 , eprint=

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing , author=. 2026 , eprint=

work page 2026

[75] [75]

2023 , eprint=

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution , author=. 2023 , eprint=

work page 2023

[76] [76]

2023 , booktitle =

Vision Grid Transformer for Document Layout Analysis , author=. 2023 , booktitle =

work page 2023

[77] [77]

2021 , eprint=

DocVQA: A Dataset for VQA on Document Images , author=. 2021 , eprint=

work page 2021

[78] [78]

2022 , eprint=

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , author=. 2022 , eprint=

work page 2022

[79] [79]

nocaps: novel object captioning at scale , url=

Agrawal, Harsh and Desai, Karan and Wang, Yufei and Chen, Xinlei and Jain, Rishabh and Johnson, Mark and Batra, Dhruv and Parikh, Devi and Lee, Stefan and Anderson, Peter , year=. nocaps: novel object captioning at scale , url=. doi:10.1109/iccv.2019.00904 , booktitle=

work page doi:10.1109/iccv.2019.00904 2019

[80] [80]

2025 , eprint=

PaddleOCR 3.0 Technical Report , author=. 2025 , eprint=

work page 2025