OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Fourier Compressor uses FFT to remove frequency-domain redundancy from visual tokens in VLMs, retaining over 96% accuracy with up to 83.8% FLOP reduction.
Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.
citing papers explorer
-
Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models
Fourier Compressor uses FFT to remove frequency-domain redundancy from visual tokens in VLMs, retaining over 96% accuracy with up to 83.8% FLOP reduction.
-
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.