PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

· 2024 · cs.CV · arXiv 2410.05970

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially for academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) that is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler selects the paragraphs or diagrams most pertinent to user queries. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of English and Chinese academic papers. Multiple strategies are proposed to build high-quality 1.1 million QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal document understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.

citation-role summary

method 1 other 1

citation-polarity summary

baseline 1 unclear 1

representative citing papers

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

cs.AI · 2026-04-14 · unverdicted · novelty 5.0 · 2 refs

DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-efficient resolution allocation.

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

cs.CV · 2025-07-14 · unverdicted · novelty 3.0

A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

cs.MM · 2024-10-28 · unverdicted · novelty 3.0

Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.

citing papers explorer

Showing 3 of 3 citing papers.

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding cs.AI · 2026-04-14 · unverdicted · none · ref 20 · 2 links · internal anchor
DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-efficient resolution allocation.
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends cs.CV · 2025-07-14 · unverdicted · none · ref 62 · internal anchor
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction cs.MM · 2024-10-28 · unverdicted · none · ref 267 · internal anchor
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer