A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
Texthawk: Exploring efficient fine- grained perception of multimodal large language models
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
MetaRA applies metamorphic testing to VQA tasks and shows that MLLM models exhibit sensitivity to linguistic perturbations and superficial visual cues not detected by conventional accuracy benchmarks.
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
citing papers explorer
-
From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models
A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
-
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
-
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
-
MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems
MetaRA applies metamorphic testing to VQA tasks and shows that MLLM models exhibit sensitivity to linguistic perturbations and superficial visual cues not detected by conventional accuracy benchmarks.
-
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
-
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.