CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
Natural Language Engineering, 30(4):870–881
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
ACCEPT 3roles
background 1polarities
background 1representative citing papers
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
A systematic survey of Multimodal RAG for document understanding proposing a taxonomy based on domain, retrieval modality, and granularity while reviewing graph structures, agentic frameworks, datasets, benchmarks, applications, and open challenges.
citing papers explorer
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
A systematic survey of Multimodal RAG for document understanding proposing a taxonomy based on domain, retrieval modality, and granularity while reviewing graph structures, agentic frameworks, datasets, benchmarks, applications, and open challenges.