Recognition: 2 theorem links
· Lean TheoremTableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
Pith reviewed 2026-05-13 17:17 UTC · model grok-4.3
The pith
Explicit spatial constraints recover the reasoning potential of multimodal models on complex hierarchical tables.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLLMs suffer from an internal Perceptual Overload on complex hierarchical tables because the number of involved discrete visual regions scales disproportionately with task complexity, impairing accurate spatial attention during implicit generation. A rendering-based deterministic grounding pipeline that couples multi-step logical deductions with pixel-perfect spatial ground truths recovers this reasoning potential, as shown by diagnostic probing and by a two-stage decoupled framework that achieves a 12.3% overall accuracy improvement on the TableVision test set.
What carries the argument
The rendering-based deterministic grounding pipeline that couples multi-step logical deductions with pixel-perfect spatial ground truths across 6,799 trajectories in the TableVision benchmark.
If this is right
- Explicit spatial constraints significantly recover MLLM reasoning performance on hierarchical tables.
- The two-stage decoupled framework delivers a robust 12.3% accuracy improvement on the test set.
- Diagnostic probing can isolate the contribution of spatial attention to overall gains.
- Tasks stratified into Perception, Reasoning, and Analysis levels allow finer evaluation of model weaknesses.
Where Pith is reading between the lines
- The same perceptual-overload pattern may appear in other structured document types such as charts or forms.
- Future architectures could embed spatial grounding internally instead of relying on an external rendering pipeline.
- The benchmark's trajectory format could support training regimes that jointly optimize perception and logic.
Load-bearing premise
The rendering-based deterministic grounding pipeline produces unbiased pixel-perfect spatial ground truths and the measured accuracy gains are caused by the spatial constraints rather than other differences in prompting or training.
What would settle it
A controlled test in which the same models receive identical spatial information but show no accuracy improvement would falsify the claim that explicit spatial constraints are what recovers reasoning potential.
Figures
read the original abstract
Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal "Perceptual Overload," where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TableVision, a large-scale benchmark for spatially grounded reasoning over complex hierarchical tables. It identifies a Perception Bottleneck in MLLMs where increasing task complexity leads to disproportionate growth in discrete visual regions and internal perceptual overload during implicit generation. The work constructs 6,799 high-fidelity reasoning trajectories across three cognitive levels (Perception, Reasoning, Analysis) and 13 sub-categories using a rendering-based deterministic grounding pipeline that couples multi-step deductions with pixel-perfect spatial ground truths. It further proposes a two-stage decoupled framework whose empirical results, supported by diagnostic probing, show that explicit spatial constraints recover MLLM reasoning potential, yielding a 12.3% overall accuracy improvement on the test set.
Significance. If the central empirical claims hold after addressing controls, this would represent a meaningful contribution to multimodal document understanding by supplying a trajectory-aware benchmark that explicitly links perception and logic, and by quantifying how spatial grounding can mitigate perceptual overload in MLLMs. The scale of the dataset and the diagnostic identification of the bottleneck are clear strengths that could serve as a testbed for future work on hierarchical table reasoning.
major comments (3)
- Experimental evaluation: The 12.3% accuracy improvement is attributed to explicit spatial constraints within the two-stage decoupled framework, yet no ablation is described that holds the two-stage architecture, prompting, and training fixed while varying only the presence of spatial grounding; without this isolating control the causal attribution remains unsecured.
- Benchmark construction (rendering pipeline): The claim that the rendering-based deterministic grounding pipeline yields unbiased pixel-perfect spatial ground truths is load-bearing for the entire benchmark, but the manuscript provides no validation, error analysis, or comparison against alternative grounding methods to confirm absence of rendering artifacts or bias.
- Perception Bottleneck analysis: The quantitative demonstration that the number of involved discrete visual regions increases disproportionately with task complexity, leading to perceptual overload, lacks the specific metrics, scaling plots, or statistical characterization needed to make the bottleneck identification fully reproducible and load-bearing for the subsequent framework design.
minor comments (2)
- Abstract: The mention of 'diagnostic probing' supporting the results would be strengthened by a brief indication of the probing techniques or key findings.
- Overall presentation: Ensure all reported accuracy figures are accompanied by error bars, number of runs, and statistical significance tests to meet standard empirical reporting expectations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to incorporating the suggested controls, validations, and expansions in the revised manuscript to strengthen the empirical claims and reproducibility.
read point-by-point responses
-
Referee: Experimental evaluation: The 12.3% accuracy improvement is attributed to explicit spatial constraints within the two-stage decoupled framework, yet no ablation is described that holds the two-stage architecture, prompting, and training fixed while varying only the presence of spatial grounding; without this isolating control the causal attribution remains unsecured.
Authors: We agree that an isolating ablation is essential to secure causal attribution. In the revised manuscript we will add a dedicated ablation study that fixes the two-stage architecture, prompting templates, and training procedure while varying only the presence or absence of explicit spatial grounding. This will directly quantify the incremental contribution of spatial constraints to the reported 12.3% accuracy gain and will be presented alongside the existing results. revision: yes
-
Referee: Benchmark construction (rendering pipeline): The claim that the rendering-based deterministic grounding pipeline yields unbiased pixel-perfect spatial ground truths is load-bearing for the entire benchmark, but the manuscript provides no validation, error analysis, or comparison against alternative grounding methods to confirm absence of rendering artifacts or bias.
Authors: We acknowledge the need for explicit validation of the grounding pipeline. We will add a new subsection that reports error analysis on a randomly sampled subset of 200 tables, comparing the rendering-derived spatial ground truths against independent human annotations. We will also include quantitative metrics (e.g., pixel-level IoU and bounding-box precision) and a brief discussion of potential rendering artifacts, thereby substantiating the claim of pixel-perfect grounding. revision: yes
-
Referee: Perception Bottleneck analysis: The quantitative demonstration that the number of involved discrete visual regions increases disproportionately with task complexity, leading to perceptual overload, lacks the specific metrics, scaling plots, or statistical characterization needed to make the bottleneck identification fully reproducible and load-bearing for the subsequent framework design.
Authors: We agree that additional quantitative detail is required for reproducibility. In the revised manuscript we will expand the Perception Bottleneck section with (i) per-complexity-level statistics on the number of discrete visual regions, (ii) scaling plots that visualize region growth against task complexity, and (iii) statistical measures including Pearson correlation and regression slopes. These additions will make the bottleneck analysis fully reproducible and will directly motivate the design of the decoupled framework. revision: yes
Circularity Check
No significant circularity in benchmark construction or empirical results
full rationale
The paper introduces TableVision as a new benchmark with a rendering-based deterministic grounding pipeline and reports an observed 12.3% accuracy lift from a two-stage decoupled framework on its test set. No equations, fitted parameters, or derivations reduce any claimed result to its inputs by construction. The central claims rest on empirical measurements and diagnostic probing rather than self-referential definitions or self-citation chains that force the outcome. Self-evaluation on a newly constructed dataset introduces no circularity under the specified patterns, as the pipeline is described as independently verifiable and the accuracy gains are presented as measured outcomes rather than renamed fits.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs experience perceptual overload proportional to the number of discrete visual regions in complex tables
invented entities (1)
-
Perception Bottleneck
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our empirical results... demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rendering-based deterministic grounding pipeline... pixel-perfect spatial ground truths
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, O.J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., de Almeida, D.M., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., ing Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bog- donoff, L., Boiko, O., laine Boyd, M., Brakman, A.L.,...
work page 2023
-
[2]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 4, 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. ArXivabs/2502.13923(2025),https: //api.semanticscholar.org/CorpusID:2764...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Cao, B., Lu, H., Ma, C., Wang, T., Li, R., Fan, J.: Orthogonal hierarchical de- composition for structure-aware table understanding with large language models (2026),https://api.semanticscholar.org/CorpusID:2852694222
work page 2026
-
[5]
TableMaster: A Recipe to Advance Table Understanding with Language Models
Cao, L., Liu, H.: Tablemaster: A recipe to advance table understanding with lan- guage models. arXiv preprint arXiv:2501.19378 (2025) 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
In: Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing
Cao, Y., Chen, S., Liu, R., Wang, Z., Fried, D.: Api-assisted code generation for question answering on varied table structures. In: Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing. pp. 14536–14548 (2023) 4
work page 2023
-
[8]
ArXivabs/2010.10439(2020),https://api
Chen, W., Chang, M.W., Schlinger, E., Wang, W.Y., Cohen, W.W.: Open question answering over tables and text. ArXivabs/2010.10439(2020),https://api. semanticscholar.org/CorpusID:2248036012
-
[9]
ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392
Chen, W., Wang, H., Chen, J., Zhang, Y., Wang, H., LI, S., Zhou, X., Wang, W.Y.: Tabfact: A large-scale dataset for table-based fact verification. ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392
-
[10]
In: Proceedings of the ACL (2020) 5, 6
Chen, W., et al.: Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In: Proceedings of the ACL (2020) 5, 6
work page 2020
-
[11]
ArXivabs/2406.01584(2024),https://api.semanticscholar.org/CorpusID: 2702159842
Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision language model. ArXivabs/2406.01584(2024),https://api.semanticscholar.org/CorpusID: 2702159842
-
[12]
arXiv preprint arXiv:2510.17800 (2025) 11
Cheng, J., Liu, Y., Zhang, X., Fei, Y., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., Bai, Y., Tang, J., Wang, H., Huang, M.: Glyph: Scaling context windows via visual-text compression. arXiv preprint arXiv:2510.17800 (2025) 11
-
[13]
Cheng, Z., Dong, H., Wang, Z., Jia, R., Guo, J., Gao, Y., Han, S., Lou, J.G., Zhang, D.: Hitab: A hierarchical table dataset for question answering and natural language generation. In: Annual Meeting of the Association for Computational Linguistics (2021),https://api.semanticscholar.org/CorpusID:2370913772
work page 2021
-
[14]
Cheng, Z., Dong, H., Wang, Z., Jia, R., Guo, J., Gao, Y., Han, S., Lou, J.G., Zhang, D.: Hitab: A hierarchical table dataset for question answering and natural languagegeneration.In:Proceedingsofthe60thAnnualMeetingoftheAssociation TableVision: A Large-Scale Benchmark for Table Reasoning 17 for Computational Linguistics (Volume 1: Long Papers). pp. 1094–1...
work page 2022
-
[15]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B.A., Fung, P., Hoi, S.C.H.:Instructblip:Towardsgeneral-purposevision-languagemodelswithinstruc- tion tuning. ArXivabs/2305.06500(2023),https://api.semanticscholar. org/CorpusID:2586152662
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Dao, D.H., Huynh, N.T., Tran, K.Q., Nguyen, K.V.: Open-vitabqa: A novel bench- mark for vietnamese question answering on open domain wikipedia table. Knowl. Based Syst.330, 114391 (2025).https://doi.org/10.1016/J.KNOSYS.2025. 114391,https://doi.org/10.1016/j.knosys.2025.1143915, 6
-
[17]
DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J.M., et al.: Deepseek-r1 in- centivizes reasoning in llms through reinforcement learning. Nature645, 633–638 (2025) 2
work page 2025
-
[18]
Fu, X., Liu, M., Yang, Z., Corring, J., Lu, Y., Yang, J., Roth, D., Florencio, D., Zhang, C.: Refocus: Visual editing as a chain of thought for structured image understanding. arXiv preprint arXiv:2501.05452 (2025) 5, 6
-
[19]
In: European Conference on Computer Vision
Guo, Z., Xu, R., Yao, Y., Cui, J., Ni, Z., Ge, C., Chua, T.S., Liu, Z., Huang, G.: Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. In: European Conference on Computer Vision. pp. 390–406. Springer (2024) 4
work page 2024
-
[20]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023) 5
work page 2023
-
[21]
Gupta, V., Mehta, M., Nokhiz, P., Srikumar, V.: Infotabs: Inference on tables as semi-structured data. In: Annual Meeting of the Association for Computational Linguistics (2020),https://api.semanticscholar.org/CorpusID:2186140952
work page 2020
-
[22]
Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.M.: Tapas: Weakly supervised table parsing via pre-training. In: Annual Meeting of the Associa- tion for Computational Linguistics (2020),https://api.semanticscholar.org/ CorpusID:2148029012
work page 2020
-
[23]
Deepeyesv2: Toward agentic multimodal model
Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 (2025) 5
-
[24]
Promptcap: Prompt-guided task- aware image captioning
Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: Prompt- guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022) 5
-
[25]
Transactions of the Association for Computational Linguistics 6, 49–61 (2017) 2
Janner, M., Narasimhan, K., Barzilay, R.: Representation learning for grounded spatial reasoning. Transactions of the Association for Computational Linguistics 6, 49–61 (2017) 2
work page 2017
-
[26]
arXiv preprint arXiv:2512.11099 (2025) 2, 5
Kang, W., Kuen, J., Ren, M., Wei, Z., Yan, Y., Liu, K.: Vgent: Visual ground- ing via modular design for disentangling reasoning and prediction. arXiv preprint arXiv:2512.11099 (2025) 2, 5
-
[27]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Kang, X., Wang, Z., Jin, X., Wang, W., Huang, K., Wang, Q.: Template-driven llm- paraphrased framework for tabular math word problem generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 24303–24311 (2025) 5, 6
work page 2025
-
[28]
Kang, X., Wu, S., Wang, Z., Liu, Y., Jin, X., Huang, K., Wang, W., Yue, Y., Huang, X., Wang, Q.: Can grpo boost complex multimodal table understanding? In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 12642–12655 (2025) 4
work page 2025
-
[29]
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: European Confer- ence on Computer Vision (2021),https://api.semanticscholar.org/CorpusID: 2509248702 18 X. Chen, L. Dai, et al
work page 2021
-
[30]
Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024
Kim, Y., Yim, M., Song, K.Y.: Tablevqa-bench: A visual question answering benchmark on multiple table domains. ArXivabs/2404.19205(2024),https: //api.semanticscholar.org/CorpusID:2694571605, 6
-
[31]
In: Findings of the Association for Computational Lin- guistics: ACL 2023
Liu, F., Eisenschlos, J., Piccinno, F., Krichene, S., Pang, C., Lee, K., Joshi, M., Chen, W., Collier, N., Altun, Y.: Deplot: One-shot visual language reasoning by plot-to-table translation. In: Findings of the Association for Computational Lin- guistics: ACL 2023. pp. 10381–10399 (2023) 4
work page 2023
-
[32]
Advances in neural information processing systems36, 34892–34916 (2023) 4
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 4
work page 2023
-
[33]
In: European conference on computer vision
Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. In: European conference on computer vision. pp. 126–142. Springer (2024) 4
work page 2024
-
[34]
Liu, S., Zhang, Z., Hu, P., Ma, J., Du, J., Wang, Q., Zhang, J., Liu, C.: See then tell: Enhancing key information extraction with vision grounding. Neurocomputing p. 132858 (2026) 4
work page 2026
-
[35]
Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Lompo, B.A., Haraoui, M.: Visual-tableqa: Open-domain benchmark for reasoning over table images. CoRRabs/2509.07966(2025).https://doi.org/10.48550/ ARXIV.2509.07966,https://doi.org/10.48550/arXiv.2509.079665, 6
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.079665 2025
-
[36]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision
Mallis, D., Karadeniz, A.S., Cavada, S., Rukhovich, D., Foteinopoulou, N., Cherenkova, K., Kacem, A., Aouada, D.: Cad-assistant: tool-augmented vllms as generic cad task solvers. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 7284–7294 (2025) 5
work page 2025
-
[37]
2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp
Nassar, A.S., Livathinos, N., Lysak, M., Staar, P.W.J.: Tableformer: Table struc- ture understanding with transformers. 2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp. 4604–4613 (2022),https: //api.semanticscholar.org/CorpusID:2472186605, 6
work page 2022
-
[38]
In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Parikh, A., Wang, X., Gehrmann, S., Faruqui, M., Dhingra, B., Yang, D., Das, D.: Totto: A controlled table-to-text generation dataset. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1173–1186 (2020) 5, 6
work page 2020
-
[39]
Pasupat, P., Liang, P.: Compositional semantic parsing on semi-structured tables. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers). pp. 1470–1480 (2015) 5, 6
work page 2015
-
[40]
Shi, H., Xie, Y., Goncalves, L., Gao, S., Zhao, J.: Wikidt: Visual-based table recog- nition and question answering dataset. In: Document Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30 - Septem- ber4,2024,Proceedings,PartI.LectureNotesinComputerScience,vol.14804,pp. 406–437. Springer (2024).https://doi.org...
-
[41]
arXiv preprint arXiv:2511.17238 (2025) 2
Singh, A., Chaudhary, R., Singh, G., Kumary, A.: Lost in translation and noise: A deep dive into the failure modes of vlms on real-world tables. arXiv preprint arXiv:2511.17238 (2025) 2
-
[42]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025) 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 5 TableVision: A Large-Scale Benchmark for Table Reasoning 19
work page 2024
-
[44]
Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Wu, J., Yang, L., Li, D., Ji, Y., Okumura, M., Zhang, Y.: Mmqa: Evaluating llms with multi-table multi-hop complex questions. In: International Conference on Learning Representations (ICLR) 2025 (2025),https://openreview.net/forum? id=GGlpykXDCa, oral 5, 6
work page 2025
-
[47]
In: Findings of the Association for Computational Linguistics: ACL 2025
Wu, P., Yang, Y., Zhu, G., Ye, C., Gu, H., Lu, X., Xiao, R., Bao, B., He, Y., Zha, L., et al.: Realhitbench: A comprehensive realistic hierarchical table bench- mark for evaluating llm-based table analysis. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 7105–7137 (2025) 5, 6
work page 2025
-
[48]
ArXivabs/2408.09174(2024),https: //api.semanticscholar.org/CorpusID:2719028395, 6
Wu, X., Yang, J., Chai, L., Zhang, G., Liu, J., Du, X., Liang, D., Shu, D., Cheng, X., Sun, T., Niu, G., Li, T., Li, Z.: Tablebench: A comprehensive and complex benchmark for table question answering. ArXivabs/2408.09174(2024),https: //api.semanticscholar.org/CorpusID:2719028395, 6
-
[49]
ArXivabs/2506.05587(2025),https://api.semanticscholar
Xing, J., He, Y., Zhou, M., Dong, H., Han, S., Chen, L., Zhang, D., Chaudhuri, S., Jagadish, H.V.: Mmtu: A massive multi-task table understanding and reason- ing benchmark. ArXivabs/2506.05587(2025),https://api.semanticscholar. org/CorpusID:2792439055, 6
-
[50]
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Xu, P., Wang, S., Zhu, Y., Li, J., Zhang, Y.: Spatialbench: Benchmarking multi- modal large language models for spatial cognition. ArXivabs/2511.21471(2025), https://api.semanticscholar.org/CorpusID:2832621532
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
arXiv preprint arXiv:2505.11409 (2025) 5
Xu, Y., Li, C., Zhou, H., Wan, X., Zhang, C., Korhonen, A., Vulić, I.: Visual planning: Let’s think only with images. arXiv preprint arXiv:2505.11409 (2025) 5
-
[52]
Yang, Y., Patel, A., Deitke, M., Gupta, T., Weihs, L., Head, A., Yatskar, M., Callison-Burch, C., Krishna, R., Kembhavi, A., et al.: Scaling text-rich image un- derstanding via code-guided synthetic multimodal data generation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). pp. 17486–1...
work page 2025
-
[53]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024) 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
arXiv preprint arXiv:2510.07098 (2025) 4
Yutong, G., Wang, W., Wu, Y., Miao, Z., Wang, H.: Talent: Table vqa via augmented language-enhanced natural-text transcription. arXiv preprint arXiv:2510.07098 (2025) 4
-
[55]
ArXivabs/2508.19813(2025),https://api.semanticscholar.org/CorpusID: 2809185872
Zhang, J., Pan, C., Wei, K., Xiong, S., Zhao, Y., Li, X., Peng, J., Gu, X., Yang, J., Chang, W., Wu, Z., Zhong, J., Song, S., Li, Y., Li, X.: T2r-bench: A benchmark for generating article-level reports from real world industrial tables. ArXivabs/2508.19813(2025),https://api.semanticscholar.org/CorpusID: 2809185872
-
[56]
Zhao, W., Feng, H., Liu, Q., Tang, J., Wei, S., Wu, B., Liao, L., Ye, Y., Liu, H., Li, H., Huang, C.: Tabpedia: Towards comprehensive visual table understanding with concept synergy. CoRRabs/2406.01326(2024).https://doi.org/10.48550/ ARXIV.2406.01326,https://doi.org/10.48550/arXiv.2406.013264, 5, 6
-
[57]
In: Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)
Zhao, W., Liu, Y., Wan, Y., Wang, Y., Deng, Z., Yu, P.S.: Localize, retrieve and fuse: A generalized framework for free-form question answering over tables. In: Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings). pp. 1–12 (2023) 4 20 X. Chen, L. Dai, et al
work page 2023
-
[58]
Zheng, M., Feng, X., Si, Q., She, Q., Lin, Z., Jiang, W., Wang, W.: Multimodal ta- ble understanding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9102–9124 (2024) 4
work page 2024
-
[59]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025) 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
In: European conference on computer vision
Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: European conference on computer vision. pp. 564–580. Springer (2020) 4
work page 2020
-
[61]
Zhong, X., Shafieibavani, E., Jimeno-Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Proceedings of the European Conference on Com- puter Vision (ECCV) (2020),https://api.semanticscholar.org/CorpusID: 2082678585, 6
work page 2020
-
[62]
In: Proceedings of the ACL (2021) 5, 6
Zhu, F., et al.: Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In: Proceedings of the ACL (2021) 5, 6
work page 2021
-
[63]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.