pith. machine review for the scientific record. sign in

arxiv: 2604.03660 · v1 · submitted 2026-04-04 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:17 UTC · model grok-4.3

classification 💻 cs.AI
keywords spatial groundinghierarchical tablesperception bottleneckmultimodal reasoningTableVisiondocument understandingMLLMs
0
0 comments X

The pith

Explicit spatial constraints recover the reasoning potential of multimodal models on complex hierarchical tables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a perception bottleneck in multimodal large language models when processing complex tables with hierarchical layouts. As task complexity grows, the number of discrete visual regions increases disproportionately, causing an internal perceptual overload that disrupts accurate spatial attention during reasoning. To test this, the authors build TableVision, a benchmark with 6,799 trajectories that explicitly links logical steps to pixel-perfect spatial locations across perception, reasoning, and analysis tasks. Their experiments show that supplying these explicit spatial constraints restores model performance, and a two-stage decoupled framework delivers a 12.3 percent accuracy gain on the test set.

Core claim

MLLMs suffer from an internal Perceptual Overload on complex hierarchical tables because the number of involved discrete visual regions scales disproportionately with task complexity, impairing accurate spatial attention during implicit generation. A rendering-based deterministic grounding pipeline that couples multi-step logical deductions with pixel-perfect spatial ground truths recovers this reasoning potential, as shown by diagnostic probing and by a two-stage decoupled framework that achieves a 12.3% overall accuracy improvement on the TableVision test set.

What carries the argument

The rendering-based deterministic grounding pipeline that couples multi-step logical deductions with pixel-perfect spatial ground truths across 6,799 trajectories in the TableVision benchmark.

If this is right

  • Explicit spatial constraints significantly recover MLLM reasoning performance on hierarchical tables.
  • The two-stage decoupled framework delivers a robust 12.3% accuracy improvement on the test set.
  • Diagnostic probing can isolate the contribution of spatial attention to overall gains.
  • Tasks stratified into Perception, Reasoning, and Analysis levels allow finer evaluation of model weaknesses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same perceptual-overload pattern may appear in other structured document types such as charts or forms.
  • Future architectures could embed spatial grounding internally instead of relying on an external rendering pipeline.
  • The benchmark's trajectory format could support training regimes that jointly optimize perception and logic.

Load-bearing premise

The rendering-based deterministic grounding pipeline produces unbiased pixel-perfect spatial ground truths and the measured accuracy gains are caused by the spatial constraints rather than other differences in prompting or training.

What would settle it

A controlled test in which the same models receive identical spatial information but show no accuracy improvement would falsify the claim that explicit spatial constraints are what recovers reasoning potential.

Figures

Figures reproduced from arXiv: 2604.03660 by Hanqing Wang, Hui Xiong, Junyong Lin, Lu Dai, Wenbin Dai, Xiaoyu Chen, Yanzong Zheng, Zhenggang Xia, Zhuoyu Li.

Figure 1
Figure 1. Figure 1: The Motivation and Insight of TableVision. (I) Motivation: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TableVision benchmark and the proposed grounding [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The TableVision Annotation Pipeline. The framework integrates deter￾ministic rendering for coordinate mapping, LLM-based rationale generation (CoT), and a human-in-the-loop verification loop (Modify/Drop) to ensure high-fidelity spatial￾logical alignment. Step 3: Deterministic Alignment and Decoupling. In the final stage, the pipeline executes a deterministic matching algorithm to transform semantic tags i… view at source ↗
Figure 4
Figure 4. Figure 4: The SFT training pipeline of our framework. We fine-tune LoRA adapters on a frozen Qwen3-VL-8B-Instruct backbone. The process is decoupled into Explanatory Structural Localization and Grounding-Conditioned Reasoning. Input Formulation and Tokenization. The model acts as a structural parser guided by a system prompt. The prompt defines five semantic label types (col￾umn, row, cell, colhead, and rowhead) to … view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative analysis of Stage-1 spatial grounding and its correlation with downstream S2 Pipeline accuracy. achieves a median IoU of 0.672, with 61.8% of predicted boxes exceeding the 0.5 IoU threshold. However, the performance drops significantly as the precision requirement increases, with only 12.2% of boxes achieving IoU ≥ 0.9. This confirms that high-precision localization in dense tables remains the… view at source ↗
read the original abstract

Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal "Perceptual Overload," where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TableVision, a large-scale benchmark for spatially grounded reasoning over complex hierarchical tables. It identifies a Perception Bottleneck in MLLMs where increasing task complexity leads to disproportionate growth in discrete visual regions and internal perceptual overload during implicit generation. The work constructs 6,799 high-fidelity reasoning trajectories across three cognitive levels (Perception, Reasoning, Analysis) and 13 sub-categories using a rendering-based deterministic grounding pipeline that couples multi-step deductions with pixel-perfect spatial ground truths. It further proposes a two-stage decoupled framework whose empirical results, supported by diagnostic probing, show that explicit spatial constraints recover MLLM reasoning potential, yielding a 12.3% overall accuracy improvement on the test set.

Significance. If the central empirical claims hold after addressing controls, this would represent a meaningful contribution to multimodal document understanding by supplying a trajectory-aware benchmark that explicitly links perception and logic, and by quantifying how spatial grounding can mitigate perceptual overload in MLLMs. The scale of the dataset and the diagnostic identification of the bottleneck are clear strengths that could serve as a testbed for future work on hierarchical table reasoning.

major comments (3)
  1. Experimental evaluation: The 12.3% accuracy improvement is attributed to explicit spatial constraints within the two-stage decoupled framework, yet no ablation is described that holds the two-stage architecture, prompting, and training fixed while varying only the presence of spatial grounding; without this isolating control the causal attribution remains unsecured.
  2. Benchmark construction (rendering pipeline): The claim that the rendering-based deterministic grounding pipeline yields unbiased pixel-perfect spatial ground truths is load-bearing for the entire benchmark, but the manuscript provides no validation, error analysis, or comparison against alternative grounding methods to confirm absence of rendering artifacts or bias.
  3. Perception Bottleneck analysis: The quantitative demonstration that the number of involved discrete visual regions increases disproportionately with task complexity, leading to perceptual overload, lacks the specific metrics, scaling plots, or statistical characterization needed to make the bottleneck identification fully reproducible and load-bearing for the subsequent framework design.
minor comments (2)
  1. Abstract: The mention of 'diagnostic probing' supporting the results would be strengthened by a brief indication of the probing techniques or key findings.
  2. Overall presentation: Ensure all reported accuracy figures are accompanied by error bars, number of runs, and statistical significance tests to meet standard empirical reporting expectations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to incorporating the suggested controls, validations, and expansions in the revised manuscript to strengthen the empirical claims and reproducibility.

read point-by-point responses
  1. Referee: Experimental evaluation: The 12.3% accuracy improvement is attributed to explicit spatial constraints within the two-stage decoupled framework, yet no ablation is described that holds the two-stage architecture, prompting, and training fixed while varying only the presence of spatial grounding; without this isolating control the causal attribution remains unsecured.

    Authors: We agree that an isolating ablation is essential to secure causal attribution. In the revised manuscript we will add a dedicated ablation study that fixes the two-stage architecture, prompting templates, and training procedure while varying only the presence or absence of explicit spatial grounding. This will directly quantify the incremental contribution of spatial constraints to the reported 12.3% accuracy gain and will be presented alongside the existing results. revision: yes

  2. Referee: Benchmark construction (rendering pipeline): The claim that the rendering-based deterministic grounding pipeline yields unbiased pixel-perfect spatial ground truths is load-bearing for the entire benchmark, but the manuscript provides no validation, error analysis, or comparison against alternative grounding methods to confirm absence of rendering artifacts or bias.

    Authors: We acknowledge the need for explicit validation of the grounding pipeline. We will add a new subsection that reports error analysis on a randomly sampled subset of 200 tables, comparing the rendering-derived spatial ground truths against independent human annotations. We will also include quantitative metrics (e.g., pixel-level IoU and bounding-box precision) and a brief discussion of potential rendering artifacts, thereby substantiating the claim of pixel-perfect grounding. revision: yes

  3. Referee: Perception Bottleneck analysis: The quantitative demonstration that the number of involved discrete visual regions increases disproportionately with task complexity, leading to perceptual overload, lacks the specific metrics, scaling plots, or statistical characterization needed to make the bottleneck identification fully reproducible and load-bearing for the subsequent framework design.

    Authors: We agree that additional quantitative detail is required for reproducibility. In the revised manuscript we will expand the Perception Bottleneck section with (i) per-complexity-level statistics on the number of discrete visual regions, (ii) scaling plots that visualize region growth against task complexity, and (iii) statistical measures including Pearson correlation and regression slopes. These additions will make the bottleneck analysis fully reproducible and will directly motivate the design of the decoupled framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or empirical results

full rationale

The paper introduces TableVision as a new benchmark with a rendering-based deterministic grounding pipeline and reports an observed 12.3% accuracy lift from a two-stage decoupled framework on its test set. No equations, fitted parameters, or derivations reduce any claimed result to its inputs by construction. The central claims rest on empirical measurements and diagnostic probing rather than self-referential definitions or self-citation chains that force the outcome. Self-evaluation on a newly constructed dataset introduces no circularity under the specified patterns, as the pipeline is described as independently verifiable and the accuracy gains are presented as measured outcomes rather than renamed fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the existence of a perception bottleneck in MLLMs for hierarchical tables and on the assumption that explicit spatial grounding directly mitigates it; the benchmark itself supplies the evaluation substrate.

axioms (1)
  • domain assumption MLLMs experience perceptual overload proportional to the number of discrete visual regions in complex tables
    Identified via quantitative analysis described in the abstract.
invented entities (1)
  • Perception Bottleneck no independent evidence
    purpose: Explains the scaling failure of MLLMs on hierarchical tables
    Introduced as the core diagnostic finding from the paper's analysis.

pith-pipeline@v0.9.0 · 5559 in / 1275 out tokens · 49120 ms · 2026-05-13T17:17:06.768845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 12 internal anchors

  1. [1]

    Achiam, O.J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., de Almeida, D.M., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., ing Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bog- donoff, L., Boiko, O., laine Boyd, M., Brakman, A.L.,...

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 4, 11

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. ArXivabs/2502.13923(2025),https: //api.semanticscholar.org/CorpusID:2764...

  4. [4]

    Cao, B., Lu, H., Ma, C., Wang, T., Li, R., Fan, J.: Orthogonal hierarchical de- composition for structure-aware table understanding with large language models (2026),https://api.semanticscholar.org/CorpusID:2852694222

  5. [5]

    TableMaster: A Recipe to Advance Table Understanding with Language Models

    Cao, L., Liu, H.: Tablemaster: A recipe to advance table understanding with lan- guage models. arXiv preprint arXiv:2501.19378 (2025) 4

  6. [6]

    In: Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing

    Cao, Y., Chen, S., Liu, R., Wang, Z., Fried, D.: Api-assisted code generation for question answering on varied table structures. In: Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing. pp. 14536–14548 (2023) 4

  7. [8]

    ArXivabs/2010.10439(2020),https://api

    Chen, W., Chang, M.W., Schlinger, E., Wang, W.Y., Cohen, W.W.: Open question answering over tables and text. ArXivabs/2010.10439(2020),https://api. semanticscholar.org/CorpusID:2248036012

  8. [9]

    ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392

    Chen, W., Wang, H., Chen, J., Zhang, Y., Wang, H., LI, S., Zhou, X., Wang, W.Y.: Tabfact: A large-scale dataset for table-based fact verification. ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392

  9. [10]

    In: Proceedings of the ACL (2020) 5, 6

    Chen, W., et al.: Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In: Proceedings of the ACL (2020) 5, 6

  10. [11]

    ArXivabs/2406.01584(2024),https://api.semanticscholar.org/CorpusID: 2702159842

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision language model. ArXivabs/2406.01584(2024),https://api.semanticscholar.org/CorpusID: 2702159842

  11. [12]

    arXiv preprint arXiv:2510.17800 (2025) 11

    Cheng, J., Liu, Y., Zhang, X., Fei, Y., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., Bai, Y., Tang, J., Wang, H., Huang, M.: Glyph: Scaling context windows via visual-text compression. arXiv preprint arXiv:2510.17800 (2025) 11

  12. [13]

    In: Annual Meeting of the Association for Computational Linguistics (2021),https://api.semanticscholar.org/CorpusID:2370913772

    Cheng, Z., Dong, H., Wang, Z., Jia, R., Guo, J., Gao, Y., Han, S., Lou, J.G., Zhang, D.: Hitab: A hierarchical table dataset for question answering and natural language generation. In: Annual Meeting of the Association for Computational Linguistics (2021),https://api.semanticscholar.org/CorpusID:2370913772

  13. [14]

    Cheng, Z., Dong, H., Wang, Z., Jia, R., Guo, J., Gao, Y., Han, S., Lou, J.G., Zhang, D.: Hitab: A hierarchical table dataset for question answering and natural languagegeneration.In:Proceedingsofthe60thAnnualMeetingoftheAssociation TableVision: A Large-Scale Benchmark for Table Reasoning 17 for Computational Linguistics (Volume 1: Long Papers). pp. 1094–1...

  14. [15]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B.A., Fung, P., Hoi, S.C.H.:Instructblip:Towardsgeneral-purposevision-languagemodelswithinstruc- tion tuning. ArXivabs/2305.06500(2023),https://api.semanticscholar. org/CorpusID:2586152662

  15. [16]

    Dao, D.H., Huynh, N.T., Tran, K.Q., Nguyen, K.V.: Open-vitabqa: A novel bench- mark for vietnamese question answering on open domain wikipedia table. Knowl. Based Syst.330, 114391 (2025).https://doi.org/10.1016/J.KNOSYS.2025. 114391,https://doi.org/10.1016/j.knosys.2025.1143915, 6

  16. [17]

    Nature645, 633–638 (2025) 2

    DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J.M., et al.: Deepseek-r1 in- centivizes reasoning in llms through reinforcement learning. Nature645, 633–638 (2025) 2

  17. [18]

    Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

    Fu, X., Liu, M., Yang, Z., Corring, J., Lu, Y., Yang, J., Roth, D., Florencio, D., Zhang, C.: Refocus: Visual editing as a chain of thought for structured image understanding. arXiv preprint arXiv:2501.05452 (2025) 5, 6

  18. [19]

    In: European Conference on Computer Vision

    Guo, Z., Xu, R., Yao, Y., Cui, J., Ni, Z., Ge, C., Chua, T.S., Liu, Z., Huang, G.: Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. In: European Conference on Computer Vision. pp. 390–406. Springer (2024) 4

  19. [20]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023) 5

  20. [21]

    In: Annual Meeting of the Association for Computational Linguistics (2020),https://api.semanticscholar.org/CorpusID:2186140952

    Gupta, V., Mehta, M., Nokhiz, P., Srikumar, V.: Infotabs: Inference on tables as semi-structured data. In: Annual Meeting of the Association for Computational Linguistics (2020),https://api.semanticscholar.org/CorpusID:2186140952

  21. [22]

    In: Annual Meeting of the Associa- tion for Computational Linguistics (2020),https://api.semanticscholar.org/ CorpusID:2148029012

    Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.M.: Tapas: Weakly supervised table parsing via pre-training. In: Annual Meeting of the Associa- tion for Computational Linguistics (2020),https://api.semanticscholar.org/ CorpusID:2148029012

  22. [23]

    Deepeyesv2: Toward agentic multimodal model

    Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 (2025) 5

  23. [24]

    Promptcap: Prompt-guided task- aware image captioning

    Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: Prompt- guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022) 5

  24. [25]

    Transactions of the Association for Computational Linguistics 6, 49–61 (2017) 2

    Janner, M., Narasimhan, K., Barzilay, R.: Representation learning for grounded spatial reasoning. Transactions of the Association for Computational Linguistics 6, 49–61 (2017) 2

  25. [26]

    arXiv preprint arXiv:2512.11099 (2025) 2, 5

    Kang, W., Kuen, J., Ren, M., Wei, Z., Yan, Y., Liu, K.: Vgent: Visual ground- ing via modular design for disentangling reasoning and prediction. arXiv preprint arXiv:2512.11099 (2025) 2, 5

  26. [27]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Kang, X., Wang, Z., Jin, X., Wang, W., Huang, K., Wang, Q.: Template-driven llm- paraphrased framework for tabular math word problem generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 24303–24311 (2025) 5, 6

  27. [28]

    Kang, X., Wu, S., Wang, Z., Liu, Y., Jin, X., Huang, K., Wang, W., Yue, Y., Huang, X., Wang, Q.: Can grpo boost complex multimodal table understanding? In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 12642–12655 (2025) 4

  28. [29]

    In: European Confer- ence on Computer Vision (2021),https://api.semanticscholar.org/CorpusID: 2509248702 18 X

    Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: European Confer- ence on Computer Vision (2021),https://api.semanticscholar.org/CorpusID: 2509248702 18 X. Chen, L. Dai, et al

  29. [30]

    Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024

    Kim, Y., Yim, M., Song, K.Y.: Tablevqa-bench: A visual question answering benchmark on multiple table domains. ArXivabs/2404.19205(2024),https: //api.semanticscholar.org/CorpusID:2694571605, 6

  30. [31]

    In: Findings of the Association for Computational Lin- guistics: ACL 2023

    Liu, F., Eisenschlos, J., Piccinno, F., Krichene, S., Pang, C., Lee, K., Joshi, M., Chen, W., Collier, N., Altun, Y.: Deplot: One-shot visual language reasoning by plot-to-table translation. In: Findings of the Association for Computational Lin- guistics: ACL 2023. pp. 10381–10399 (2023) 4

  31. [32]

    Advances in neural information processing systems36, 34892–34916 (2023) 4

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 4

  32. [33]

    In: European conference on computer vision

    Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. In: European conference on computer vision. pp. 126–142. Springer (2024) 4

  33. [34]

    Neurocomputing p

    Liu, S., Zhang, Z., Hu, P., Ma, J., Du, J., Wang, Q., Zhang, J., Liu, C.: See then tell: Enhancing key information extraction with vision grounding. Neurocomputing p. 132858 (2026) 4

  34. [35]

    Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

    Lompo, B.A., Haraoui, M.: Visual-tableqa: Open-domain benchmark for reasoning over table images. CoRRabs/2509.07966(2025).https://doi.org/10.48550/ ARXIV.2509.07966,https://doi.org/10.48550/arXiv.2509.079665, 6

  35. [36]

    In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision

    Mallis, D., Karadeniz, A.S., Cavada, S., Rukhovich, D., Foteinopoulou, N., Cherenkova, K., Kacem, A., Aouada, D.: Cad-assistant: tool-augmented vllms as generic cad task solvers. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 7284–7294 (2025) 5

  36. [37]

    2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp

    Nassar, A.S., Livathinos, N., Lysak, M., Staar, P.W.J.: Tableformer: Table struc- ture understanding with transformers. 2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp. 4604–4613 (2022),https: //api.semanticscholar.org/CorpusID:2472186605, 6

  37. [38]

    In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Parikh, A., Wang, X., Gehrmann, S., Faruqui, M., Dhingra, B., Yang, D., Das, D.: Totto: A controlled table-to-text generation dataset. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1173–1186 (2020) 5, 6

  38. [39]

    Pasupat, P., Liang, P.: Compositional semantic parsing on semi-structured tables. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers). pp. 1470–1480 (2015) 5, 6

  39. [40]

    Shi, H., Xie, Y., Goncalves, L., Gao, S., Zhao, J.: Wikidt: Visual-based table recog- nition and question answering dataset. In: Document Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30 - Septem- ber4,2024,Proceedings,PartI.LectureNotesinComputerScience,vol.14804,pp. 406–437. Springer (2024).https://doi.org...

  40. [41]

    arXiv preprint arXiv:2511.17238 (2025) 2

    Singh, A., Chaudhary, R., Singh, G., Kumary, A.: Lost in translation and noise: A deep dive into the failure modes of vlms on real-world tables. arXiv preprint arXiv:2511.17238 (2025) 2

  41. [42]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025) 4

  42. [43]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 5 TableVision: A Large-Scale Benchmark for Table Reasoning 19

  43. [44]

    Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 5

  44. [45]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 4

  45. [46]

    In: International Conference on Learning Representations (ICLR) 2025 (2025),https://openreview.net/forum? id=GGlpykXDCa, oral 5, 6

    Wu, J., Yang, L., Li, D., Ji, Y., Okumura, M., Zhang, Y.: Mmqa: Evaluating llms with multi-table multi-hop complex questions. In: International Conference on Learning Representations (ICLR) 2025 (2025),https://openreview.net/forum? id=GGlpykXDCa, oral 5, 6

  46. [47]

    In: Findings of the Association for Computational Linguistics: ACL 2025

    Wu, P., Yang, Y., Zhu, G., Ye, C., Gu, H., Lu, X., Xiao, R., Bao, B., He, Y., Zha, L., et al.: Realhitbench: A comprehensive realistic hierarchical table bench- mark for evaluating llm-based table analysis. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 7105–7137 (2025) 5, 6

  47. [48]

    ArXivabs/2408.09174(2024),https: //api.semanticscholar.org/CorpusID:2719028395, 6

    Wu, X., Yang, J., Chai, L., Zhang, G., Liu, J., Du, X., Liang, D., Shu, D., Cheng, X., Sun, T., Niu, G., Li, T., Li, Z.: Tablebench: A comprehensive and complex benchmark for table question answering. ArXivabs/2408.09174(2024),https: //api.semanticscholar.org/CorpusID:2719028395, 6

  48. [49]

    ArXivabs/2506.05587(2025),https://api.semanticscholar

    Xing, J., He, Y., Zhou, M., Dong, H., Han, S., Chen, L., Zhang, D., Chaudhuri, S., Jagadish, H.V.: Mmtu: A massive multi-task table understanding and reason- ing benchmark. ArXivabs/2506.05587(2025),https://api.semanticscholar. org/CorpusID:2792439055, 6

  49. [50]

    SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

    Xu, P., Wang, S., Zhu, Y., Li, J., Zhang, Y.: Spatialbench: Benchmarking multi- modal large language models for spatial cognition. ArXivabs/2511.21471(2025), https://api.semanticscholar.org/CorpusID:2832621532

  50. [51]

    arXiv preprint arXiv:2505.11409 (2025) 5

    Xu, Y., Li, C., Zhou, H., Wan, X., Zhang, C., Korhonen, A., Vulić, I.: Visual planning: Let’s think only with images. arXiv preprint arXiv:2505.11409 (2025) 5

  51. [52]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers)

    Yang, Y., Patel, A., Deitke, M., Gupta, T., Weihs, L., Head, A., Yatskar, M., Callison-Burch, C., Krishna, R., Kembhavi, A., et al.: Scaling text-rich image un- derstanding via code-guided synthetic multimodal data generation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). pp. 17486–1...

  52. [53]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024) 11

  53. [54]

    arXiv preprint arXiv:2510.07098 (2025) 4

    Yutong, G., Wang, W., Wu, Y., Miao, Z., Wang, H.: Talent: Table vqa via augmented language-enhanced natural-text transcription. arXiv preprint arXiv:2510.07098 (2025) 4

  54. [55]

    ArXivabs/2508.19813(2025),https://api.semanticscholar.org/CorpusID: 2809185872

    Zhang, J., Pan, C., Wei, K., Xiong, S., Zhao, Y., Li, X., Peng, J., Gu, X., Yang, J., Chang, W., Wu, Z., Zhong, J., Song, S., Li, Y., Li, X.: T2r-bench: A benchmark for generating article-level reports from real world industrial tables. ArXivabs/2508.19813(2025),https://api.semanticscholar.org/CorpusID: 2809185872

  55. [56]

    CoRRabs/2406.01326(2024).https://doi.org/10.48550/ ARXIV.2406.01326,https://doi.org/10.48550/arXiv.2406.013264, 5, 6

    Zhao, W., Feng, H., Liu, Q., Tang, J., Wei, S., Wu, B., Liao, L., Ye, Y., Liu, H., Li, H., Huang, C.: Tabpedia: Towards comprehensive visual table understanding with concept synergy. CoRRabs/2406.01326(2024).https://doi.org/10.48550/ ARXIV.2406.01326,https://doi.org/10.48550/arXiv.2406.013264, 5, 6

  56. [57]

    In: Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

    Zhao, W., Liu, Y., Wan, Y., Wang, Y., Deng, Z., Yu, P.S.: Localize, retrieve and fuse: A generalized framework for free-form question answering over tables. In: Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings). pp. 1–12 (2023) 4 20 X. Chen, L. Dai, et al

  57. [58]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Zheng, M., Feng, X., Si, Q., She, Q., Lin, Z., Jiang, W., Wang, W.: Multimodal ta- ble understanding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9102–9124 (2024) 4

  58. [59]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025) 5

  59. [60]

    In: European conference on computer vision

    Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: European conference on computer vision. pp. 564–580. Springer (2020) 4

  60. [61]

    In: Proceedings of the European Conference on Com- puter Vision (ECCV) (2020),https://api.semanticscholar.org/CorpusID: 2082678585, 6

    Zhong, X., Shafieibavani, E., Jimeno-Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Proceedings of the European Conference on Com- puter Vision (ECCV) (2020),https://api.semanticscholar.org/CorpusID: 2082678585, 6

  61. [62]

    In: Proceedings of the ACL (2021) 5, 6

    Zhu, F., et al.: Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In: Proceedings of the ACL (2021) 5, 6

  62. [63]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 11