MMTABREAL: Real-World Benchmark for Multimodal Table Understanding

Chitta Baral; Jainil Trivedi; Prasham Titiya; Vivek Gupta

arxiv: 2505.21771 · v2 · pith:ZM2XLIYZnew · submitted 2025-05-27 · 💻 cs.CV · cs.AI

MMTABREAL: Real-World Benchmark for Multimodal Table Understanding

Prasham Titiya , Jainil Trivedi , Chitta Baral , Vivek Gupta This is my paper

classification 💻 cs.CV cs.AI

keywords multimodalmmtabrealreal-worldreasoningtablesbenchmarkevaluationmodels

0 comments

read the original abstract

Multimodal tables i.e. tabular layouts interleaved with charts, maps, icons, and color encodings are ubiquitous in real applications yet remain difficult for Multimodal Large Language Models (MLLMs). Despite advances in text and image understanding, systematic evaluation of table-centric multimodal reasoning is limited. We introduce MMTABREAL, a MultiModal Table Benchmark, human-curated suite of 500 real-world tables paired with 4,021 question-answer pairs. MMTABREAL spans four question types, five reasoning categories, and eight structural archetypes. Evaluations of state-of-the-art models reveal substantial gaps, especially in visual grounding, spatial alignment, and multi-step inference, with 20-40% performance drops relative to existing benchmarks. These results highlight the need for architectures that more tightly fuse vision with tabular structure and support explicit numeric/logical operations. MMTABREAL is released for evaluation only, providing a rigorous, reproducible testbed that reflects the linguistic, structural, and reasoning complexity of real-world multimodal tables.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity
cs.CL 2026-05 unverdicted novelty 7.0

TableVista benchmark finds foundation models maintain performance across visual styles but degrade sharply on complex table structures and vision-only settings.
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
cs.CV 2026-05 unverdicted novelty 7.0

VT-Bench is the first unified benchmark aggregating 14 visual-tabular datasets with over 756K samples and evaluating 23 models to expose challenges in this multi-modal area.
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
cs.CV 2026-05 unverdicted novelty 7.0

VT-Bench aggregates 14 datasets totaling over 756K samples across 9 domains and evaluates 23 models to establish a unified testbed for visual-tabular multi-modal discriminative and generative tasks.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
cs.AI 2026-04 unverdicted novelty 6.0

V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.