A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG). This survey provides a comprehensive and timely review of document parsing research. We propose a systematic taxonomy that organizes existing approaches into modular pipeline-based systems and unified models driven by Vision-Language Models (VLMs). We provide a detailed review of key components in pipeline systems, including layout analysis and the recognition of heterogeneous content such as text, tables, mathematical expressions, and visual elements, and then systematically track the evolution of specialized VLMs for document parsing. Additionally, we summarize widely adopted evaluation metrics and high-quality benchmarks that establish current standards for parsing quality. Finally, we discuss key open challenges, including robustness to complex layouts, reliability of VLM-based parsing, and inference efficiency, and outline directions for building more accurate and scalable document intelligence systems.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.
citing papers explorer
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
-
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
-
RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering
RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.