pith. sign in

arxiv: 2510.09671 · v2 · submitted 2025-10-08 · 💻 cs.CL

Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation

Pith reviewed 2026-05-18 09:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords table question answeringlarge language modelssurveybenchmarkstask setupsmodeling strategiesevaluationreinforcement learning
0
0 comments X

The pith

This survey organizes table question answering research by categorizing benchmarks, task setups, and LLM modeling strategies while identifying open problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a structured overview of table question answering using large language models by categorizing benchmarks according to table representation, question complexity, modalities, and domains. It groups current modeling strategies by the specific challenges they target and analyzes their strengths and limitations. A sympathetic reader cares because TQA spans diverse settings that have proliferated without clear connections, making it difficult to track progress or spot gaps. The survey unifies these threads to provide a foundation for understanding the state of the art and guiding future work, including in areas like reinforcement learning.

Core claim

The field of table question answering lacks systematic organization of task formulations, core challenges, and methodological trends. This survey addresses the gap by providing a comprehensive categorization of existing benchmarks and task setups, grouping current modeling strategies according to the challenges they target, and highlighting underexplored but timely topics that have not been systematically covered in prior research, thereby offering a consolidated foundation for the TQA community.

What carries the argument

The categorization framework that organizes benchmarks by representation, complexity, modality and domain, and groups modeling strategies by targeted challenges.

If this is right

  • Researchers gain a clearer map of existing benchmarks and can more readily identify which task setups remain underexplored.
  • Modeling approaches can be compared directly by the challenges they address rather than by isolated papers.
  • New work on reinforcement learning or other emerging directions can be situated within the highlighted gaps.
  • The survey's analysis of strengths and limitations can steer choices among LLM-based methods for specific TQA settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same challenge-based grouping might transfer to other structured-data QA tasks such as knowledge-base question answering.
  • Future benchmarks could be designed explicitly to test the boundaries between the categories the survey defines.
  • Empirical studies could measure whether models trained under one challenge category generalize to another.

Load-bearing premise

That prior TQA research is fragmented enough for a single survey to unify the threads and deliver a useful organization that researchers can build on.

What would settle it

A follow-up analysis that identifies many recent papers or trends that fall outside the proposed benchmark categories and challenge-based groupings, showing the organization does not capture the field's structure.

Figures

Figures reproduced from arXiv: 2510.09671 by Annemarie Friedrich, Bolei Ma, Mohsen Mesgar, Wei Zhou.

Figure 1
Figure 1. Figure 1: Different table question answering task setups. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A taxonomy of TQA task setups. We list representative papers for each setup. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A taxonomy of methods categorized by challenges. We list representative papers for each challenge. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of the collected paper. We show [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of (M)LLMs in textual and image-based table understanding. FT-Model denotes fine-tuned models, specifically TableLlaVA-7B (Zheng et al., 2024) and TableLlaMA-7B (Zhang et al., 2024b). OCR refers to configurations in which image tables are first converted to text via optical character recognition (OCR) and then processed using TableLlaMA-7B. in [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

Table Question Answering (TQA) aims to answer natural language questions about tabular data, often accompanied by additional contexts such as text passages. The task spans diverse settings, varying in table representation, question/answer complexity, modality involved, and domain. While recent advances in large language models (LLMs) have led to substantial progress in TQA, the field still lacks a systematic organization and understanding of task formulations, core challenges, and methodological trends, particularly in light of emerging research directions such as reinforcement learning. This survey addresses this gap by providing a comprehensive and structured overview of TQA research with a focus on LLM-based methods. We provide a comprehensive categorization of existing benchmarks and task setups. We group current modeling strategies according to the challenges they target, and analyze their strengths and limitations. Furthermore, we highlight underexplored but timely topics that have not been systematically covered in prior research. By unifying disparate research threads and identifying open problems, our survey offers a consolidated foundation for the TQA community, enabling a deeper understanding of the state of the art and guiding future developments in this rapidly evolving area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is a survey on Table Question Answering (TQA) in the era of large language models. It claims to address the lack of systematic organization in the field by delivering a comprehensive categorization of existing benchmarks and task setups, grouping current LLM-based modeling strategies according to the challenges they target, analyzing their strengths and limitations, and highlighting underexplored topics such as reinforcement learning applications.

Significance. If the categorization proves complete and the analysis balanced, the survey would provide a valuable consolidated foundation for the TQA community. By unifying disparate research threads across modalities, domains, and LLM adaptations, it can enable a deeper understanding of the state of the art and guide future work in this rapidly evolving area. The structured grouping of methods by targeted challenges is a useful contribution for researchers.

minor comments (3)
  1. [Abstract] Abstract: The reference to 'emerging research directions such as reinforcement learning' would be more concrete if accompanied by at least one specific citation or example of recent work in that direction.
  2. [§1] §1: Consider adding a short table early in the introduction that contrasts traditional TQA settings with LLM-adapted variants (e.g., differences in table representation and answer complexity) to improve accessibility for readers unfamiliar with the subfield.
  3. [Introduction] The survey would benefit from an explicit statement of the time period and search methodology used to collect the reviewed papers, which would strengthen the claim of comprehensive coverage.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our survey and for recommending minor revision. The referee accurately captures the manuscript's focus on systematically organizing TQA benchmarks, LLM-based modeling strategies grouped by targeted challenges, and underexplored directions such as reinforcement learning. We are pleased that the structured overview is viewed as a useful foundation for the community.

Circularity Check

0 steps flagged

No significant circularity in survey synthesis

full rationale

This is a survey paper whose central contribution is a descriptive categorization of existing TQA benchmarks, task setups, and LLM-based modeling strategies drawn from the published literature. No equations, fitted parameters, predictions, or derivations are present that could reduce to the paper's own inputs by construction. The framing of addressing a 'lack of systematic organization' is a standard meta-claim for survey articles and does not rely on self-citation chains or self-definitional loops for its validity; the work remains self-contained as an external synthesis of prior threads.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey the paper rests on standard NLP community definitions of TQA and prior published work for its content; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond domain assumptions about task diversity.

axioms (1)
  • domain assumption Table Question Answering spans diverse settings varying in table representation, question/answer complexity, modality, and domain.
    Stated in the abstract as the basis for the task scope and categorization.

pith-pipeline@v0.9.0 · 5737 in / 1182 out tokens · 33048 ms · 2026-05-18T09:14:03.766443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

    cs.IR 2026-05 unverdicted novelty 6.0

    PIPER retrieves and ranks tabular datasets by profiling their content and using LLM-generated queries for dense vector search, outperforming metadata baselines and TableQA methods in low-metadata settings.

  2. OmniTQA: A Cost-Aware System for Hybrid Query Processing over Semi-Structured Data

    cs.DB 2026-04 unverdicted novelty 6.0

    OmniTQA integrates LLM semantic reasoning as a first-class query operator with classical relational operators in a cost-aware planner for hybrid structured and semi-structured data.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    Tanq: An open domain dataset of table an- swered questions.Trans. Assoc. Comput. Linguistics, 13:461–480. Rana Alshaikh, Israa Alghanmi, and Shelan S. Jeawak

  2. [2]

    Shir Ashury-Tahan, Yifan Mai, C Rajmohan, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, and Michal Shmueli-Scheuer

    Aratable: Benchmarking llms’ reasoning and understanding of arabic tabular data.ArXiv, abs/2507.18442. Shir Ashury-Tahan, Yifan Mai, C Rajmohan, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, and Michal Shmueli-Scheuer. 2025. The mighty torr: A benchmark for table reasoning and robustness.ArXiv, abs/2502....

  3. [3]

    Jacob Beck, Anna Steinberg, Andreas Dimmelmeier, Laia Domenech Burin, Emily Kormanyos, Maurice Fehr, and Malte Schierholz

    Ttqa-rs- a break-down prompting approach for multi-hop table-text question answering with rea- soning and summarization.ArXiv, abs/2406.14732. Jacob Beck, Anna Steinberg, Andreas Dimmelmeier, Laia Domenech Burin, Emily Kormanyos, Maurice Fehr, and Malte Schierholz. 2025. Addressing data gaps in sustainability reporting: A benchmark dataset for greenhouse ...

  4. [4]

    InProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, page 181–189, New York, NY , USA

    Ttc-quali: A text-table-chart dataset for multi- modal quantity alignment. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, page 181–189, New York, NY , USA. Association for Computing Machinery. Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socol- insky, Srini...

  5. [5]

    InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5773–5780, Albuquerque, New Mexico

    TabComp: A dataset for visual table read- ing comprehension. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5773–5780, Albuquerque, New Mexico. Association for Computational Linguistics. Carlos Gemmell and Jeff Dalton. 2023. ToolWriter: Question specific tool synthesis for tabular data. In Proceedings of the 2023 Conference...

  6. [6]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22376–22391, Vienna, Austria

    TableLoRA: Low-rank adaptation on table structure understanding for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22376–22391, Vienna, Austria. Association for Computational Linguistics. Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma, Rui Ding, Lun Du, Yan Gao,...

  7. [7]

    InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 4320–4333, Online

    TaPas: Weakly supervised table parsing via pre-training. InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 4320–4333, Online. Association for Computa- tional Linguistics. Maximiliano Hormazábal-Lagos, Álvaro Bueno Saez, Pedro Alonso Doval, Jorge Alcalde Vesteiro, and Héctor Cerezo-Costas. 2025. Explicit-qa:...

  8. [8]

    InProceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 474–483, Berlin, Germany

    Tables as semi-structured knowledge for ques- tion answering. InProceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 474–483, Berlin, Germany. Association for Computational Lin- guistics. Deyi Ji, Lanyun Zhu, Siqi Gao, Peng Xu, Hongtao Lu, Jieping Ye, and Feng Zhao. 2024. Tree-of-table: U...

  9. [9]

    AIT-QA: Question answering dataset over complex tables in the airline industry. InProceed- ings of the 2022 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies: Industry Track, pages 305–314, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics. Rohit Khoja, De...

  10. [10]

    Tablevqa-bench: A visual question answering benchmark on multiple table domains

    Tablevqa-bench: A visual question answer- ing benchmark on multiple table domains.ArXiv, abs/2404.19205. Atsushi Kojima. 2024. Sub-table rescorer for table ques- tion answering. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 15422–15427, Torino, Italia. ...

  11. [11]

    InProceedings of the 4th Table Representation Learning Workshop, pages 217– 228, Vienna, Austria

    Improving table retrieval with question gener- ation from partial tables. InProceedings of the 4th Table Representation Learning Workshop, pages 217– 228, Vienna, Austria. Association for Computational Linguistics. Weizhe Lin, Rexhina Blloshmi, Bill Byrne, Adria de Gispert, and Gonzalo Iglesias. 2023. An inner table retriever for robust table question ans...

  12. [12]

    InProceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1282–1288, Vienna, Austria

    LyS at SemEval 2025 task 8: Zero-shot code generation for tabular QA. InProceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1282–1288, Vienna, Austria. Association for Computational Linguistics. Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and A. Kalyan. 2022. Dynami...

  13. [13]

    InProceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 13471–13488, Torino, Italia

    Question answering over tabular data with DataBench: A large-scale empirical evaluation of LLMs. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 13471–13488, Torino, Italia. ELRA and ICCL. Vaishali Pal, Evangelos Kanoulas, Andrew Yates, and Maarten de R...

  14. [14]

    Atakan Site, Emre Hakan Erdemir, and Gül¸ sen Ery- i˘git

    Mtabvqa: Evaluating multi-tabular reason- ing of language models in visual space.ArXiv, abs/2506.11684. Atakan Site, Emre Hakan Erdemir, and Gül¸ sen Ery- i˘git. 2025. Itunlp at semeval-2025 task 8: Question- answering over tabular data: A zero-shot approach using llm-driven code generation. Josefa Lia Stoisser, Marc Boubnovski Martell, and Julien Fauqueu...

  15. [15]

    Matata: Weakly supervised end-to-end mathe- matical tool-augmented reasoning for tabular appli- cations. A. A. Vyatkin and V . D. Oliseenko. 2025. Generating pandas code for big table question answering using large language models. In2025 XXVIII International Conference on Soft Computing and Measurements (SCM), pages 164–166. Hanjun Wang, Wenda Liu, Qun W...

  16. [16]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Chain of thought prompting elicits reasoning in large language models.ArXiv, abs/2201.11903. Cornelius Wolff and Madelon Hulsebos. 2025. How well do llms reason over tabular data, really?ArXiv, abs/2505.07453. Jian Wu, Linyi Yang, Dongyuan Li, Yuliang Ji, Manabu Okumura, and Yue Zhang. 2025a. MMQA: Evaluat- ing LLMs with multi-table multi-hop complex ques...

  17. [17]

    InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2131–2153, Singapore

    CRT-QA: A dataset of complex reasoning question answering over tabular data. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2131–2153, Singapore. Association for Computational Linguis- tics. Bowen Zhao, Tianhao Cheng, Yuejie Zhang, Ying Cheng, Rui Feng, and Xiaobo Zhang. 2024a. Ct2c- qa: Multimodal questi...

  18. [18]

    InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6588–6600, Dublin, Ireland

    MultiHiertt: Numerical reasoning over multi hierarchical tabular and textual data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6588–6600, Dublin, Ireland. Association for Computational Linguistics. Yilun Zhao, Chen Zhao, Linyong Nan, Zhenting Qi, Wenlin Zhang, Xiangru Tang, Boyu ...

  19. [19]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    Seq2sql: Generating structured queries from natural language using reinforcement learning. ArXiv, abs/1709.00103. Bangbang Zhou, Zuan Gao, Zixiao Wang, Boqiang Zhang, Yuxin Wang, Zhineng Chen, and Hongtao Xie. 2025a. SynTab-LLaV A: Enhancing Multimodal Table Understanding with Decoupled Synthesis . In 2025 IEEE/CVF Conference on Computer Vision and Patter...