CoExVQA uses a chain-of-explanation to ground DocVQA answers in localized document regions, achieving state-of-the-art explainable performance with a 12% ANLS gain on PFL-DocVQA over prior baselines.
Layoutlmv3: Pre-training for document ai with unified text and image masking
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
MMTR-Bench shows that current MLLMs face significant difficulty reconstructing masked text from visual context, especially at sentence and paragraph lengths.
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
citing papers explorer
-
Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions
CoExVQA uses a chain-of-explanation to ground DocVQA answers in localized document regions, achieving state-of-the-art explainable performance with a 12% ANLS gain on PFL-DocVQA over prior baselines.
-
Can MLLMs "Read" What is Missing?
MMTR-Bench shows that current MLLMs face significant difficulty reconstructing masked text from visual context, especially at sentence and paragraph lengths.
-
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.