arxiv: 2604.03660 · v1 · submitted 2026-04-04 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

Xiaoyu Chen , Lu Dai , Hanqing Wang , Zhuoyu Li , Wenbin Dai , Yanzong Zheng , Zhenggang Xia , Junyong Lin

show 1 more author

Hui Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords spatial groundinghierarchical tablesperception bottleneckmultimodal reasoningTableVisiondocument understandingMLLMs

0 comments

The pith

Explicit spatial constraints recover the reasoning potential of multimodal models on complex hierarchical tables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a perception bottleneck in multimodal large language models when processing complex tables with hierarchical layouts. As task complexity grows, the number of discrete visual regions increases disproportionately, causing an internal perceptual overload that disrupts accurate spatial attention during reasoning. To test this, the authors build TableVision, a benchmark with 6,799 trajectories that explicitly links logical steps to pixel-perfect spatial locations across perception, reasoning, and analysis tasks. Their experiments show that supplying these explicit spatial constraints restores model performance, and a two-stage decoupled framework delivers a 12.3 percent accuracy gain on the test set.

Core claim

MLLMs suffer from an internal Perceptual Overload on complex hierarchical tables because the number of involved discrete visual regions scales disproportionately with task complexity, impairing accurate spatial attention during implicit generation. A rendering-based deterministic grounding pipeline that couples multi-step logical deductions with pixel-perfect spatial ground truths recovers this reasoning potential, as shown by diagnostic probing and by a two-stage decoupled framework that achieves a 12.3% overall accuracy improvement on the TableVision test set.

What carries the argument

The rendering-based deterministic grounding pipeline that couples multi-step logical deductions with pixel-perfect spatial ground truths across 6,799 trajectories in the TableVision benchmark.

If this is right

Explicit spatial constraints significantly recover MLLM reasoning performance on hierarchical tables.
The two-stage decoupled framework delivers a robust 12.3% accuracy improvement on the test set.
Diagnostic probing can isolate the contribution of spatial attention to overall gains.
Tasks stratified into Perception, Reasoning, and Analysis levels allow finer evaluation of model weaknesses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perceptual-overload pattern may appear in other structured document types such as charts or forms.
Future architectures could embed spatial grounding internally instead of relying on an external rendering pipeline.
The benchmark's trajectory format could support training regimes that jointly optimize perception and logic.

Load-bearing premise

The rendering-based deterministic grounding pipeline produces unbiased pixel-perfect spatial ground truths and the measured accuracy gains are caused by the spatial constraints rather than other differences in prompting or training.

What would settle it

A controlled test in which the same models receive identical spatial information but show no accuracy improvement would falsify the claim that explicit spatial constraints are what recovers reasoning potential.

Figures

Figures reproduced from arXiv: 2604.03660 by Hanqing Wang, Hui Xiong, Junyong Lin, Lu Dai, Wenbin Dai, Xiaoyu Chen, Yanzong Zheng, Zhenggang Xia, Zhuoyu Li.

**Figure 2.** Figure 2: Overview of the TableVision benchmark and the proposed grounding [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The TableVision Annotation Pipeline. The framework integrates deterministic rendering for coordinate mapping, LLM-based rationale generation (CoT), and a human-in-the-loop verification loop (Modify/Drop) to ensure high-fidelity spatiallogical alignment. Step 3: Deterministic Alignment and Decoupling. In the final stage, the pipeline executes a deterministic matching algorithm to transform semantic tags i… view at source ↗

**Figure 4.** Figure 4: The SFT training pipeline of our framework. We fine-tune LoRA adapters on a frozen Qwen3-VL-8B-Instruct backbone. The process is decoupled into Explanatory Structural Localization and Grounding-Conditioned Reasoning. Input Formulation and Tokenization. The model acts as a structural parser guided by a system prompt. The prompt defines five semantic label types (column, row, cell, colhead, and rowhead) to … view at source ↗

**Figure 5.** Figure 5: Quantitative analysis of Stage-1 spatial grounding and its correlation with downstream S2 Pipeline accuracy. achieves a median IoU of 0.672, with 61.8% of predicted boxes exceeding the 0.5 IoU threshold. However, the performance drops significantly as the precision requirement increases, with only 12.2% of boxes achieving IoU ≥ 0.9. This confirms that high-precision localization in dense tables remains the… view at source ↗

read the original abstract

Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal "Perceptual Overload," where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TableVision adds a new benchmark with cognitive stratification and spatial trajectories for table reasoning in MLLMs, plus a reported 12.3% lift from a two-stage framework, but the gain is not isolated to the spatial component.

read the letter

The one thing to know is that this paper introduces TableVision, a benchmark of 6,799 trajectories over hierarchical tables that explicitly tags spatial locations via a rendering pipeline and splits tasks into perception, reasoning, and analysis levels. They pair it with a two-stage decoupled framework that they say recovers 12.3% accuracy on their test set by adding explicit spatial constraints to MLLMs.

Referee Report

3 major / 2 minor

Summary. The paper introduces TableVision, a large-scale benchmark for spatially grounded reasoning over complex hierarchical tables. It identifies a Perception Bottleneck in MLLMs where increasing task complexity leads to disproportionate growth in discrete visual regions and internal perceptual overload during implicit generation. The work constructs 6,799 high-fidelity reasoning trajectories across three cognitive levels (Perception, Reasoning, Analysis) and 13 sub-categories using a rendering-based deterministic grounding pipeline that couples multi-step deductions with pixel-perfect spatial ground truths. It further proposes a two-stage decoupled framework whose empirical results, supported by diagnostic probing, show that explicit spatial constraints recover MLLM reasoning potential, yielding a 12.3% overall accuracy improvement on the test set.

Significance. If the central empirical claims hold after addressing controls, this would represent a meaningful contribution to multimodal document understanding by supplying a trajectory-aware benchmark that explicitly links perception and logic, and by quantifying how spatial grounding can mitigate perceptual overload in MLLMs. The scale of the dataset and the diagnostic identification of the bottleneck are clear strengths that could serve as a testbed for future work on hierarchical table reasoning.

major comments (3)

Experimental evaluation: The 12.3% accuracy improvement is attributed to explicit spatial constraints within the two-stage decoupled framework, yet no ablation is described that holds the two-stage architecture, prompting, and training fixed while varying only the presence of spatial grounding; without this isolating control the causal attribution remains unsecured.
Benchmark construction (rendering pipeline): The claim that the rendering-based deterministic grounding pipeline yields unbiased pixel-perfect spatial ground truths is load-bearing for the entire benchmark, but the manuscript provides no validation, error analysis, or comparison against alternative grounding methods to confirm absence of rendering artifacts or bias.
Perception Bottleneck analysis: The quantitative demonstration that the number of involved discrete visual regions increases disproportionately with task complexity, leading to perceptual overload, lacks the specific metrics, scaling plots, or statistical characterization needed to make the bottleneck identification fully reproducible and load-bearing for the subsequent framework design.

minor comments (2)

Abstract: The mention of 'diagnostic probing' supporting the results would be strengthened by a brief indication of the probing techniques or key findings.
Overall presentation: Ensure all reported accuracy figures are accompanied by error bars, number of runs, and statistical significance tests to meet standard empirical reporting expectations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to incorporating the suggested controls, validations, and expansions in the revised manuscript to strengthen the empirical claims and reproducibility.

read point-by-point responses

Referee: Experimental evaluation: The 12.3% accuracy improvement is attributed to explicit spatial constraints within the two-stage decoupled framework, yet no ablation is described that holds the two-stage architecture, prompting, and training fixed while varying only the presence of spatial grounding; without this isolating control the causal attribution remains unsecured.

Authors: We agree that an isolating ablation is essential to secure causal attribution. In the revised manuscript we will add a dedicated ablation study that fixes the two-stage architecture, prompting templates, and training procedure while varying only the presence or absence of explicit spatial grounding. This will directly quantify the incremental contribution of spatial constraints to the reported 12.3% accuracy gain and will be presented alongside the existing results. revision: yes
Referee: Benchmark construction (rendering pipeline): The claim that the rendering-based deterministic grounding pipeline yields unbiased pixel-perfect spatial ground truths is load-bearing for the entire benchmark, but the manuscript provides no validation, error analysis, or comparison against alternative grounding methods to confirm absence of rendering artifacts or bias.

Authors: We acknowledge the need for explicit validation of the grounding pipeline. We will add a new subsection that reports error analysis on a randomly sampled subset of 200 tables, comparing the rendering-derived spatial ground truths against independent human annotations. We will also include quantitative metrics (e.g., pixel-level IoU and bounding-box precision) and a brief discussion of potential rendering artifacts, thereby substantiating the claim of pixel-perfect grounding. revision: yes
Referee: Perception Bottleneck analysis: The quantitative demonstration that the number of involved discrete visual regions increases disproportionately with task complexity, leading to perceptual overload, lacks the specific metrics, scaling plots, or statistical characterization needed to make the bottleneck identification fully reproducible and load-bearing for the subsequent framework design.

Authors: We agree that additional quantitative detail is required for reproducibility. In the revised manuscript we will expand the Perception Bottleneck section with (i) per-complexity-level statistics on the number of discrete visual regions, (ii) scaling plots that visualize region growth against task complexity, and (iii) statistical measures including Pearson correlation and regression slopes. These additions will make the bottleneck analysis fully reproducible and will directly motivate the design of the decoupled framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or empirical results

full rationale

The paper introduces TableVision as a new benchmark with a rendering-based deterministic grounding pipeline and reports an observed 12.3% accuracy lift from a two-stage decoupled framework on its test set. No equations, fitted parameters, or derivations reduce any claimed result to its inputs by construction. The central claims rest on empirical measurements and diagnostic probing rather than self-referential definitions or self-citation chains that force the outcome. Self-evaluation on a newly constructed dataset introduces no circularity under the specified patterns, as the pipeline is described as independently verifiable and the accuracy gains are presented as measured outcomes rather than renamed fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the existence of a perception bottleneck in MLLMs for hierarchical tables and on the assumption that explicit spatial grounding directly mitigates it; the benchmark itself supplies the evaluation substrate.

axioms (1)

domain assumption MLLMs experience perceptual overload proportional to the number of discrete visual regions in complex tables
Identified via quantitative analysis described in the abstract.

invented entities (1)

Perception Bottleneck no independent evidence
purpose: Explains the scaling failure of MLLMs on hierarchical tables
Introduced as the core diagnostic finding from the paper's analysis.

pith-pipeline@v0.9.0 · 5559 in / 1275 out tokens · 49120 ms · 2026-05-13T17:17:06.768845+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our empirical results... demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rendering-based deterministic grounding pipeline... pixel-perfect spatial ground truths

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 12 internal anchors

[1]

Achiam, O.J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., de Almeida, D.M., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., ing Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bog- donoff, L., Boiko, O., laine Boyd, M., Brakman, A.L.,...

work page 2023
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 4, 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. ArXivabs/2502.13923(2025),https: //api.semanticscholar.org/CorpusID:2764...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Cao, B., Lu, H., Ma, C., Wang, T., Li, R., Fan, J.: Orthogonal hierarchical de- composition for structure-aware table understanding with large language models (2026),https://api.semanticscholar.org/CorpusID:2852694222

work page 2026
[5]

TableMaster: A Recipe to Advance Table Understanding with Language Models

Cao, L., Liu, H.: Tablemaster: A recipe to advance table understanding with lan- guage models. arXiv preprint arXiv:2501.19378 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing

Cao, Y., Chen, S., Liu, R., Wang, Z., Fried, D.: Api-assisted code generation for question answering on varied table structures. In: Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing. pp. 14536–14548 (2023) 4

work page 2023
[8]

ArXivabs/2010.10439(2020),https://api

Chen, W., Chang, M.W., Schlinger, E., Wang, W.Y., Cohen, W.W.: Open question answering over tables and text. ArXivabs/2010.10439(2020),https://api. semanticscholar.org/CorpusID:2248036012

work page arXiv 2010
[9]

ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392

Chen, W., Wang, H., Chen, J., Zhang, Y., Wang, H., LI, S., Zhou, X., Wang, W.Y.: Tabfact: A large-scale dataset for table-based fact verification. ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392

work page arXiv 1909
[10]

In: Proceedings of the ACL (2020) 5, 6

Chen, W., et al.: Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In: Proceedings of the ACL (2020) 5, 6

work page 2020
[11]

ArXivabs/2406.01584(2024),https://api.semanticscholar.org/CorpusID: 2702159842

Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision language model. ArXivabs/2406.01584(2024),https://api.semanticscholar.org/CorpusID: 2702159842

work page arXiv 2024
[12]

arXiv preprint arXiv:2510.17800 (2025) 11

Cheng, J., Liu, Y., Zhang, X., Fei, Y., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., Bai, Y., Tang, J., Wang, H., Huang, M.: Glyph: Scaling context windows via visual-text compression. arXiv preprint arXiv:2510.17800 (2025) 11

work page arXiv 2025
[13]

In: Annual Meeting of the Association for Computational Linguistics (2021),https://api.semanticscholar.org/CorpusID:2370913772

Cheng, Z., Dong, H., Wang, Z., Jia, R., Guo, J., Gao, Y., Han, S., Lou, J.G., Zhang, D.: Hitab: A hierarchical table dataset for question answering and natural language generation. In: Annual Meeting of the Association for Computational Linguistics (2021),https://api.semanticscholar.org/CorpusID:2370913772

work page 2021
[14]

Cheng, Z., Dong, H., Wang, Z., Jia, R., Guo, J., Gao, Y., Han, S., Lou, J.G., Zhang, D.: Hitab: A hierarchical table dataset for question answering and natural languagegeneration.In:Proceedingsofthe60thAnnualMeetingoftheAssociation TableVision: A Large-Scale Benchmark for Table Reasoning 17 for Computational Linguistics (Volume 1: Long Papers). pp. 1094–1...

work page 2022
[15]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B.A., Fung, P., Hoi, S.C.H.:Instructblip:Towardsgeneral-purposevision-languagemodelswithinstruc- tion tuning. ArXivabs/2305.06500(2023),https://api.semanticscholar. org/CorpusID:2586152662

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Dao, D.H., Huynh, N.T., Tran, K.Q., Nguyen, K.V.: Open-vitabqa: A novel bench- mark for vietnamese question answering on open domain wikipedia table. Knowl. Based Syst.330, 114391 (2025).https://doi.org/10.1016/J.KNOSYS.2025. 114391,https://doi.org/10.1016/j.knosys.2025.1143915, 6

work page doi:10.1016/j.knosys.2025 2025
[17]

Nature645, 633–638 (2025) 2

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J.M., et al.: Deepseek-r1 in- centivizes reasoning in llms through reinforcement learning. Nature645, 633–638 (2025) 2

work page 2025
[18]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

Fu, X., Liu, M., Yang, Z., Corring, J., Lu, Y., Yang, J., Roth, D., Florencio, D., Zhang, C.: Refocus: Visual editing as a chain of thought for structured image understanding. arXiv preprint arXiv:2501.05452 (2025) 5, 6

work page arXiv 2025
[19]

In: European Conference on Computer Vision

Guo, Z., Xu, R., Yao, Y., Cui, J., Ni, Z., Ge, C., Chua, T.S., Liu, Z., Huang, G.: Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. In: European Conference on Computer Vision. pp. 390–406. Springer (2024) 4

work page 2024
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023) 5

work page 2023
[21]

In: Annual Meeting of the Association for Computational Linguistics (2020),https://api.semanticscholar.org/CorpusID:2186140952

Gupta, V., Mehta, M., Nokhiz, P., Srikumar, V.: Infotabs: Inference on tables as semi-structured data. In: Annual Meeting of the Association for Computational Linguistics (2020),https://api.semanticscholar.org/CorpusID:2186140952

work page 2020
[22]

In: Annual Meeting of the Associa- tion for Computational Linguistics (2020),https://api.semanticscholar.org/ CorpusID:2148029012

Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.M.: Tapas: Weakly supervised table parsing via pre-training. In: Annual Meeting of the Associa- tion for Computational Linguistics (2020),https://api.semanticscholar.org/ CorpusID:2148029012

work page 2020
[23]

Deepeyesv2: Toward agentic multimodal model

Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 (2025) 5

work page arXiv 2025
[24]

Promptcap: Prompt-guided task- aware image captioning

Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: Prompt- guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022) 5

work page arXiv 2022
[25]

Transactions of the Association for Computational Linguistics 6, 49–61 (2017) 2

Janner, M., Narasimhan, K., Barzilay, R.: Representation learning for grounded spatial reasoning. Transactions of the Association for Computational Linguistics 6, 49–61 (2017) 2

work page 2017
[26]

arXiv preprint arXiv:2512.11099 (2025) 2, 5

Kang, W., Kuen, J., Ren, M., Wei, Z., Yan, Y., Liu, K.: Vgent: Visual ground- ing via modular design for disentangling reasoning and prediction. arXiv preprint arXiv:2512.11099 (2025) 2, 5

work page arXiv 2025
[27]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Kang, X., Wang, Z., Jin, X., Wang, W., Huang, K., Wang, Q.: Template-driven llm- paraphrased framework for tabular math word problem generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 24303–24311 (2025) 5, 6

work page 2025
[28]

Kang, X., Wu, S., Wang, Z., Liu, Y., Jin, X., Huang, K., Wang, W., Yue, Y., Huang, X., Wang, Q.: Can grpo boost complex multimodal table understanding? In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 12642–12655 (2025) 4

work page 2025
[29]

In: European Confer- ence on Computer Vision (2021),https://api.semanticscholar.org/CorpusID: 2509248702 18 X

Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: European Confer- ence on Computer Vision (2021),https://api.semanticscholar.org/CorpusID: 2509248702 18 X. Chen, L. Dai, et al

work page 2021
[30]

Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024

Kim, Y., Yim, M., Song, K.Y.: Tablevqa-bench: A visual question answering benchmark on multiple table domains. ArXivabs/2404.19205(2024),https: //api.semanticscholar.org/CorpusID:2694571605, 6

work page arXiv 2024
[31]

In: Findings of the Association for Computational Lin- guistics: ACL 2023

Liu, F., Eisenschlos, J., Piccinno, F., Krichene, S., Pang, C., Lee, K., Joshi, M., Chen, W., Collier, N., Altun, Y.: Deplot: One-shot visual language reasoning by plot-to-table translation. In: Findings of the Association for Computational Lin- guistics: ACL 2023. pp. 10381–10399 (2023) 4

work page 2023
[32]

Advances in neural information processing systems36, 34892–34916 (2023) 4

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 4

work page 2023
[33]

In: European conference on computer vision

Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. In: European conference on computer vision. pp. 126–142. Springer (2024) 4

work page 2024
[34]

Neurocomputing p

Liu, S., Zhang, Z., Hu, P., Ma, J., Du, J., Wang, Q., Zhang, J., Liu, C.: See then tell: Enhancing key information extraction with vision grounding. Neurocomputing p. 132858 (2026) 4

work page 2026
[35]

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Lompo, B.A., Haraoui, M.: Visual-tableqa: Open-domain benchmark for reasoning over table images. CoRRabs/2509.07966(2025).https://doi.org/10.48550/ ARXIV.2509.07966,https://doi.org/10.48550/arXiv.2509.079665, 6

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.079665 2025
[36]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision

Mallis, D., Karadeniz, A.S., Cavada, S., Rukhovich, D., Foteinopoulou, N., Cherenkova, K., Kacem, A., Aouada, D.: Cad-assistant: tool-augmented vllms as generic cad task solvers. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 7284–7294 (2025) 5

work page 2025
[37]

2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp

Nassar, A.S., Livathinos, N., Lysak, M., Staar, P.W.J.: Tableformer: Table struc- ture understanding with transformers. 2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp. 4604–4613 (2022),https: //api.semanticscholar.org/CorpusID:2472186605, 6

work page 2022
[38]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Parikh, A., Wang, X., Gehrmann, S., Faruqui, M., Dhingra, B., Yang, D., Das, D.: Totto: A controlled table-to-text generation dataset. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1173–1186 (2020) 5, 6

work page 2020
[39]

Pasupat, P., Liang, P.: Compositional semantic parsing on semi-structured tables. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers). pp. 1470–1480 (2015) 5, 6

work page 2015
[40]

Shi, H., Xie, Y., Goncalves, L., Gao, S., Zhao, J.: Wikidt: Visual-based table recog- nition and question answering dataset. In: Document Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30 - Septem- ber4,2024,Proceedings,PartI.LectureNotesinComputerScience,vol.14804,pp. 406–437. Springer (2024).https://doi.org...

work page doi:10.1007/978- 2024
[41]

arXiv preprint arXiv:2511.17238 (2025) 2

Singh, A., Chaudhary, R., Singh, G., Kumary, A.: Lost in translation and noise: A deep dive into the failure modes of vlms on real-world tables. arXiv preprint arXiv:2511.17238 (2025) 2

work page arXiv 2025
[42]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 5 TableVision: A Large-Scale Benchmark for Table Reasoning 19

work page 2024
[44]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

In: International Conference on Learning Representations (ICLR) 2025 (2025),https://openreview.net/forum? id=GGlpykXDCa, oral 5, 6

Wu, J., Yang, L., Li, D., Ji, Y., Okumura, M., Zhang, Y.: Mmqa: Evaluating llms with multi-table multi-hop complex questions. In: International Conference on Learning Representations (ICLR) 2025 (2025),https://openreview.net/forum? id=GGlpykXDCa, oral 5, 6

work page 2025
[47]

In: Findings of the Association for Computational Linguistics: ACL 2025

Wu, P., Yang, Y., Zhu, G., Ye, C., Gu, H., Lu, X., Xiao, R., Bao, B., He, Y., Zha, L., et al.: Realhitbench: A comprehensive realistic hierarchical table bench- mark for evaluating llm-based table analysis. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 7105–7137 (2025) 5, 6

work page 2025
[48]

ArXivabs/2408.09174(2024),https: //api.semanticscholar.org/CorpusID:2719028395, 6

Wu, X., Yang, J., Chai, L., Zhang, G., Liu, J., Du, X., Liang, D., Shu, D., Cheng, X., Sun, T., Niu, G., Li, T., Li, Z.: Tablebench: A comprehensive and complex benchmark for table question answering. ArXivabs/2408.09174(2024),https: //api.semanticscholar.org/CorpusID:2719028395, 6

work page arXiv 2024
[49]

ArXivabs/2506.05587(2025),https://api.semanticscholar

Xing, J., He, Y., Zhou, M., Dong, H., Han, S., Chen, L., Zhang, D., Chaudhuri, S., Jagadish, H.V.: Mmtu: A massive multi-task table understanding and reason- ing benchmark. ArXivabs/2506.05587(2025),https://api.semanticscholar. org/CorpusID:2792439055, 6

work page arXiv 2025
[50]

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Xu, P., Wang, S., Zhu, Y., Li, J., Zhang, Y.: Spatialbench: Benchmarking multi- modal large language models for spatial cognition. ArXivabs/2511.21471(2025), https://api.semanticscholar.org/CorpusID:2832621532

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

arXiv preprint arXiv:2505.11409 (2025) 5

Xu, Y., Li, C., Zhou, H., Wan, X., Zhang, C., Korhonen, A., Vulić, I.: Visual planning: Let’s think only with images. arXiv preprint arXiv:2505.11409 (2025) 5

work page arXiv 2025
[52]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers)

Yang, Y., Patel, A., Deitke, M., Gupta, T., Weihs, L., Head, A., Yatskar, M., Callison-Burch, C., Krishna, R., Kembhavi, A., et al.: Scaling text-rich image un- derstanding via code-guided synthetic multimodal data generation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). pp. 17486–1...

work page 2025
[53]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024) 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

arXiv preprint arXiv:2510.07098 (2025) 4

Yutong, G., Wang, W., Wu, Y., Miao, Z., Wang, H.: Talent: Table vqa via augmented language-enhanced natural-text transcription. arXiv preprint arXiv:2510.07098 (2025) 4

work page arXiv 2025
[55]

ArXivabs/2508.19813(2025),https://api.semanticscholar.org/CorpusID: 2809185872

Zhang, J., Pan, C., Wei, K., Xiong, S., Zhao, Y., Li, X., Peng, J., Gu, X., Yang, J., Chang, W., Wu, Z., Zhong, J., Song, S., Li, Y., Li, X.: T2r-bench: A benchmark for generating article-level reports from real world industrial tables. ArXivabs/2508.19813(2025),https://api.semanticscholar.org/CorpusID: 2809185872

work page arXiv 2025
[56]

CoRRabs/2406.01326(2024).https://doi.org/10.48550/ ARXIV.2406.01326,https://doi.org/10.48550/arXiv.2406.013264, 5, 6

Zhao, W., Feng, H., Liu, Q., Tang, J., Wei, S., Wu, B., Liao, L., Ye, Y., Liu, H., Li, H., Huang, C.: Tabpedia: Towards comprehensive visual table understanding with concept synergy. CoRRabs/2406.01326(2024).https://doi.org/10.48550/ ARXIV.2406.01326,https://doi.org/10.48550/arXiv.2406.013264, 5, 6

work page doi:10.48550/arxiv.2406.013264 2024
[57]

In: Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

Zhao, W., Liu, Y., Wan, Y., Wang, Y., Deng, Z., Yu, P.S.: Localize, retrieve and fuse: A generalized framework for free-form question answering over tables. In: Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings). pp. 1–12 (2023) 4 20 X. Chen, L. Dai, et al

work page 2023
[58]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zheng, M., Feng, X., Si, Q., She, Q., Lin, Z., Jiang, W., Wang, W.: Multimodal ta- ble understanding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9102–9124 (2024) 4

work page 2024
[59]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025) 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

In: European conference on computer vision

Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: European conference on computer vision. pp. 564–580. Springer (2020) 4

work page 2020
[61]

In: Proceedings of the European Conference on Com- puter Vision (ECCV) (2020),https://api.semanticscholar.org/CorpusID: 2082678585, 6

Zhong, X., Shafieibavani, E., Jimeno-Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Proceedings of the European Conference on Com- puter Vision (ECCV) (2020),https://api.semanticscholar.org/CorpusID: 2082678585, 6

work page 2020
[62]

In: Proceedings of the ACL (2021) 5, 6

Zhu, F., et al.: Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In: Proceedings of the ACL (2021) 5, 6

work page 2021
[63]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 11

work page internal anchor Pith review Pith/arXiv arXiv 2025