Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

· 2024 · cs.MM · arXiv 2410.21169

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG). This survey provides a comprehensive and timely review of document parsing research. We propose a systematic taxonomy that organizes existing approaches into modular pipeline-based systems and unified models driven by Vision-Language Models (VLMs). We provide a detailed review of key components in pipeline systems, including layout analysis and the recognition of heterogeneous content such as text, tables, mathematical expressions, and visual elements, and then systematically track the evolution of specialized VLMs for document parsing. Additionally, we summarize widely adopted evaluation metrics and high-quality benchmarks that establish current standards for parsing quality. Finally, we discuss key open challenges, including robustness to complex layouts, reliability of VLM-based parsing, and inference efficiency, and outline directions for building more accurate and scalable document intelligence systems.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

A document is worth a structured record: Principled inductive bias design for document recognition

cs.CV · 2025-07-11 · unverdicted · novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.

LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

StrucTab: A Structured Optimization Framework for Table Parsing

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

StrucTab achieves SOTA table parsing performance by unifying structural subtasks through sequential reasoning and using decomposed RL rewards in Uni-TabRL, plus a new TableVerse-5K benchmark.

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

cs.CV · 2025-09-26 · unverdicted · novelty 6.0

MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

cs.CL · 2026-05-23 · unverdicted · novelty 4.0

Unveil proposes a visual-textual embedding model for multi-modal documents that is distilled into an efficient visual-only retriever.

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

cs.AI · 2026-05-16 · conditional · novelty 4.0

MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.

RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering

cs.IR · 2026-03-04 · unverdicted · novelty 4.0

RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.

citing papers explorer

Showing 9 of 9 citing papers.

A document is worth a structured record: Principled inductive bias design for document recognition cs.CV · 2025-07-11 · unverdicted · none · ref 37 · internal anchor
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.
LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank cs.CL · 2026-06-25 · unverdicted · none · ref 68 · internal anchor
LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale cs.CV · 2026-04-06 · unverdicted · none · ref 50 · internal anchor
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
StrucTab: A Structured Optimization Framework for Table Parsing cs.CV · 2026-06-29 · unverdicted · none · ref 62 · internal anchor
StrucTab achieves SOTA table parsing performance by unifying structural subtasks through sequential reasoning and using decomposed RL rewards in Uni-TabRL, plus a new TableVerse-5K benchmark.
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing cs.CL · 2026-05-05 · unverdicted · none · ref 23 · internal anchor
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing cs.CV · 2025-09-26 · unverdicted · none · ref 57 · internal anchor
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval cs.CL · 2026-05-23 · unverdicted · none · ref 39 · internal anchor
Unveil proposes a visual-textual embedding model for multi-modal documents that is distilled into an efficient visual-only retriever.
MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop cs.AI · 2026-05-16 · conditional · none · ref 35 · internal anchor
MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.
RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering cs.IR · 2026-03-04 · unverdicted · none · ref 27 · internal anchor
RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer