Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding,
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4roles
background 1polarities
background 1representative citing papers
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
DocVAL transfers spatial reasoning via validated CoT distillation from large teachers to compact student VLMs, delivering up to 6-7 ANLS gains and strong mAP localization on document VQA benchmarks.
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
citing papers explorer
-
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
DocVAL transfers spatial reasoning via validated CoT distillation from large teachers to compact student VLMs, delivering up to 6-7 ANLS gains and strong mAP localization on document VQA benchmarks.
-
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.