Texthawk: Exploring efficient fine- grained perception of multimodal large language models

Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, Wei Zeng · 2024 · arXiv 2404.09204

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

cs.CV · 2026-03-20 · unverdicted · novelty 7.0

A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

cs.IR · 2026-04-08 · unverdicted · novelty 6.0

ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

cs.AI · 2026-03-19 · unverdicted · novelty 6.0

MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.

MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

cs.CV · 2026-05-19 · unverdicted · novelty 5.0

MetaRA applies metamorphic testing to VQA tasks and shows that MLLM models exhibit sensitivity to linguistic perturbations and superficial visual cues not detected by conventional accuracy benchmarks.

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

cs.CV · 2024-10-08 · unverdicted · novelty 5.0

PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

cs.CV · 2025-07-14 · unverdicted · novelty 3.0

A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.

citing papers explorer

Showing 6 of 6 citing papers.

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models cs.CV · 2026-03-20 · unverdicted · none · ref 45
A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment cs.IR · 2026-04-08 · unverdicted · none · ref 95
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding cs.AI · 2026-03-19 · unverdicted · none · ref 27
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems cs.CV · 2026-05-19 · unverdicted · none · ref 39
MetaRA applies metamorphic testing to VQA tasks and shows that MLLM models exhibit sensitivity to linguistic perturbations and superficial visual cues not detected by conventional accuracy benchmarks.
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling cs.CV · 2024-10-08 · unverdicted · none · ref 62
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends cs.CV · 2025-07-14 · unverdicted · none · ref 68
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.

Texthawk: Exploring efficient fine- grained perception of multimodal large language models

fields

years

verdicts

representative citing papers

citing papers explorer