pith. sign in

arxiv: 2501.19378 · v6 · submitted 2025-01-31 · 💻 cs.CL

TableMaster: A Recipe to Advance Table Understanding with Language Models

Pith reviewed 2026-05-23 04:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords table understandinglanguage modelsadaptive reasoningtable question answeringWikiTQsemantic verbalizationsymbolic reasoning
0
0 comments X

The pith

TableMaster improves LM table understanding by extracting relevant data, adding semantic context through verbalization, and switching dynamically between textual and symbolic reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies four specific barriers that keep language models from handling tables well: trouble finding the right data inside a table, missing semantic meaning around the numbers and cells, errors when doing math in plain text, and rigid rules when using symbolic methods. It proposes TableMaster as a practical sequence that first pulls out only the needed table parts, rewrites them in natural language with extra context, and then lets the model choose on the fly whether to reason in words or in symbols for each question. If this sequence works, models could answer table-based questions more reliably without requiring bigger models or table-specific training. Experiments on the WikiTQ dataset show the approach reaching 78.13 percent accuracy with GPT-4o-mini and beating prior methods.

Core claim

TableMaster works by first extracting relevant table content and verbalizing it with enriched semantic context, then applying adaptive reasoning that dynamically selects between textual and symbolic reasoning paths depending on the query, which together address the four identified challenges and produce higher accuracy on table understanding tasks.

What carries the argument

Adaptive reasoning, a mechanism that lets the model switch between textual reasoning and symbolic reasoning for each individual query after the table has been extracted and verbalized.

If this is right

  • Language models can reach higher accuracy on table question answering benchmarks such as WikiTQ while using smaller base models like GPT-4o-mini.
  • Numerical inaccuracies that arise during textual reasoning are reduced when the system can fall back to symbolic methods for the same query.
  • Semantic gaps in raw tables are filled by first converting extracted cells into enriched natural-language descriptions.
  • The same extraction-plus-verbalization pipeline can be reused across different language models without task-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the four challenges turn out to be shared across other structured-data formats, the same extraction-verbalization-adaptive sequence could be tested on knowledge graphs or database schemas.
  • The reported gains with a compact model suggest the method may lower the compute cost of reliable table reasoning in production settings.
  • Re-running the pipeline on table datasets that emphasize different error types, such as those with heavy missing values, would test whether the four challenges remain the dominant ones.

Load-bearing premise

The four listed challenges are the main obstacles to table understanding and that extraction, verbalization, and adaptive reasoning are enough to remove them.

What would settle it

An ablation study on WikiTQ in which removing the adaptive reasoning step leaves accuracy at or above 78.13 percent would show that the switching mechanism is not required for the reported gains.

Figures

Figures reproduced from arXiv: 2501.19378 by Hanbing Liu, Lang Cao.

Figure 1
Figure 1. Figure 1: Overview of the challenges and proposed solutions in this work. Tabular data is inherently [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental analysis of challenges in table understanding with language models. (a) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The framework of TableMaster. It comprises three stages: (1) table structure understanding, where the table’s structure is analyzed, and a table-of-focus is constructed through row and column lookup; (2) table content understanding, where the table-of-focus is reconstructed based on the question, and its information is verbalized to enhance the semantic context; and (3) table reasoning for question answeri… view at source ↗
Figure 4
Figure 4. Figure 4: An example (fetaqa-164) from the FetaQA dataset where the result is accurate, but the [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance Comparison Across Table Sizes (Row Count, Column Count, Area Size, [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The row count distribution in the WikiTQ dataset and the analysis of accuracy variation [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Changes in Table Condensation After Table-of-Focus Construction in Table Structure [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study on the impact of table verbalization. The data is from the WikiTQ dataset. [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study of TableMaster. The data is from the WikiTQ dataset. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for structure extraction in TableMaster. Blue text indicates placeholders for variables within the prompt. The prompt guides the language model in extracting the table’s structure. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for column ranking in TableMaster. Blue text indicates placeholders for variables within the prompt. The prompt guides the language model to rank the priority of all columns based on the given table, top headers, and related question. ## Objective You are provided with information of a table and a question related to the table. Your task is to lookup the column indices that are needed to answer the… view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for column lookup in TableMaster. Blue text indicates placeholders for variables within the prompt. The prompt guides the language model to select relevant columns based on the given table, top headers, and related question. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for SQL generation for row lookup in [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for table verbalization in TableMaster. Blue text indicates placeholders for variables within the prompt. The prompt guides the language model to verbalize the given table by adding detailed descriptions and additional knowledge about the table. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for information estimation in TableMaster. Blue text indicates placeholders for variables within the prompt. The prompt guides the language model to evaluate the given table’s content and determine whether it contains sufficient information to answer the provided question ## Objective You are provided with a table and a question related to the table. Your task is to assess whether answering this qu… view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for reasoning strategy assessment in [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for textual reasoning in TableMaster. Blue text represents placeholders for variables within the prompt, while the grey region indicates optional sections to adapt the prompt for question-answering or fact-verification tasks. The prompt guides the language model to answer the question step by step. ## Objective You are provided with a table, a verbalized table, and a question related to the table. … view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for textual guidance generation in [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt for symbolic reasoning in TableMaster. Blue text indicates placeholders for variables within the prompt. The prompt guides the language model to generate Python code to answer the question. ## Objective You are provided with a process of text-guided reasoning with programming and a question related to the table. Your task is to answer the question using the reasoning process. ## Table Definition Th… view at source ↗
Figure 20
Figure 20. Figure 20: Prompt for answer formatting in TableMaster. Blue text indicates placeholders for variables within the prompt. The prompt guides the language model to format the final answer based on the given table, question, and reasoning process. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Direct prompt for table understanding in analysis experiment. Blue text indicates [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Chain of thought prompt for table understanding in analysis experiment. Blue text [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Program of thought prompt for table understanding in analysis experiment. Blue text [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt for table verbalization in analysis experiment. Blue text indicates placeholders [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt for textual guidance generation in analysis experiment. Blue text indicates [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prompt for reasoning strategy evaluation in analysis experiment. Blue text indicates [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Prompt for reasoning strategy evaluation in analysis experiment. Blue text indicates [PITH_FULL_IMAGE:figures/full_fig_p040_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Prompt for classifying a question type based on whether calculation is required in the [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Prompt for generating noised tables in the analysis experiment. Blue text represents [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗
read the original abstract

Tables serve as a fundamental format for representing structured relational data. While current language models (LMs) excel at many text-based tasks, they still face challenges in table understanding due to the complex characteristics of tabular data, such as their structured nature. In this paper, we aim to enhance LMs for improved table understanding. We identify four key challenges: 1) difficulty in locating target data, 2) deficiency in table semantics, 3) numerical inaccuracies in textual reasoning, and 4) semantic inflexibility in symbolic reasoning. To address these issues, we propose TableMaster, a recipe and comprehensive framework that integrates multiple solutions to overcome these obstacles. TableMaster first extracts relevant table content and verbalizes it with enriched semantic context. Additionally, we introduce adaptive reasoning, a flexible approach that dynamically adjusts between textual and symbolic reasoning, tailoring the reasoning process to each query. Extensive analyses and experiments demonstrate our findings and the effectiveness of TableMaster. On the WikiTQ dataset, TableMaster achieves an accuracy of 78.13% using GPT-4o-mini, surpassing existing baselines. We hope this work will serve as a practical step toward more robust and reliable table understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper identifies four challenges in LM-based table understanding (locating target data, deficient table semantics, numerical inaccuracies in textual reasoning, and semantic inflexibility in symbolic reasoning) and proposes TableMaster, a prompting recipe that extracts relevant table content, verbalizes it with enriched semantics, and applies adaptive reasoning that dynamically switches between textual and symbolic modes. It reports an accuracy of 78.13% on WikiTQ using GPT-4o-mini, claiming this surpasses existing baselines, and positions the work as a practical framework for more robust table understanding.

Significance. If the empirical gains are shown to be robust through controlled experiments, the work could supply a reusable prompting recipe that improves LM handling of structured data, with potential utility for downstream applications in data analysis and question answering. The contribution is primarily empirical rather than theoretical, and its significance hinges on whether the reported accuracy reflects genuine advances rather than unaccounted implementation details.

major comments (2)
  1. [Abstract] Abstract: The central performance claim (78.13% on WikiTQ with GPT-4o-mini, surpassing baselines) is presented without any baseline scores, ablation results, statistical significance tests, error bars, or implementation specifics for adaptive reasoning, rendering it impossible to evaluate whether the data supports the improvement assertion.
  2. [Abstract] Abstract: The assumption that the four listed challenges are the primary barriers and that extraction + verbalization + adaptive reasoning are sufficient to overcome them is stated without supporting analysis, references to prior work quantifying these challenges, or discussion of potential confounding factors in model behavior or evaluation protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues in the abstract that affect evaluability of our claims. We address both comments below and will revise the abstract accordingly while preserving its conciseness. The main paper already contains the supporting details referenced in our responses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (78.13% on WikiTQ with GPT-4o-mini, surpassing baselines) is presented without any baseline scores, ablation results, statistical significance tests, error bars, or implementation specifics for adaptive reasoning, rendering it impossible to evaluate whether the data supports the improvement assertion.

    Authors: We agree the abstract should enable direct evaluation of the performance claim. In revision we will insert the strongest baseline scores from the main results table (e.g., the best prior GPT-4o-mini result) and explicitly state that ablations, statistical significance tests, error bars, and adaptive-reasoning implementation details appear in Sections 4 and 5. Space constraints preclude full error bars in the abstract, but we will reference their presence in the body. revision: yes

  2. Referee: [Abstract] Abstract: The assumption that the four listed challenges are the primary barriers and that extraction + verbalization + adaptive reasoning are sufficient to overcome them is stated without supporting analysis, references to prior work quantifying these challenges, or discussion of potential confounding factors in model behavior or evaluation protocols.

    Authors: The four challenges are synthesized from prior empirical studies on table reasoning failures; we will add two concise citations in the revised abstract to the works that first quantified numerical inaccuracies in textual reasoning and semantic rigidity in symbolic methods. A dedicated analysis section in the full paper examines confounding factors (model scale, prompt sensitivity, evaluation protocol) with controlled experiments. We will also insert a short clause noting that the sufficiency claim is supported by the ablation study rather than asserted a priori. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim is an empirical performance result (78.13% accuracy on WikiTQ) obtained via a prompting recipe that extracts table content, verbalizes it, and applies adaptive reasoning. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the provided text. The four challenges are stated as observations and addressed through explicit procedural steps whose effectiveness is validated externally on benchmark data rather than by construction or self-reference. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described or implied in the abstract.

pith-pipeline@v0.9.0 · 5731 in / 943 out tokens · 86398 ms · 2026-05-23T04:13:55.059626+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

    cs.AI 2026-04 conditional novelty 7.0

    TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.

  2. Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    SpreadsheetAgent uses incremental multi-format reading, structural sketching, and verification to raise spreadsheet benchmark accuracy from 35.27% to 38.16%.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 2 Pith papers

  1. [1]

    Process mining in healthcare: a systematised literature review

    Mahdi Ghasemi and Daniel Amyot. Process mining in healthcare: a systematised literature review. International Journal of Electronic Healthcare, 9(1):60–88, 2016

  2. [2]

    Gfte: Graph-based financial table extraction, 2020

    Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction, 2020

  3. [3]

    Textbooks are all you need, 2023

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tau- man Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, 2023

  4. [4]

    Gpt-4 technical report, 2024

    OpenAI. Gpt-4 technical report, 2024

  5. [5]

    Llama: Open and efficient foundation language models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

  6. [6]

    Large language models: A survey, 2024

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2024

  7. [7]

    Yilun Zhu, Joel Ruben Antony Moniz, Shruti Bhargava, Jiarui Lu, Dhivya Piraviperumal, Site Li, Yuan Zhang, Hong Yu, and Bo-Hsiang Tseng. Can large language models understand context? In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 2004–2018, St. Julian’s, Malta, March 2024. Associat...

  8. [8]

    Reasoning with large language models, a survey, 2024

    Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. Reasoning with large language models, a survey, 2024

  9. [9]

    Large language models(llms) on tabular data: Prediction, generation, and understanding – a survey, 2024

    Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. Large language models(llms) on tabular data: Prediction, generation, and understanding – a survey, 2024

  10. [10]

    A survey of table reasoning with large language models, 2024

    Xuanliang Zhang, Dingzirui Wang, Longxu Dou, Qingfu Zhu, and Wanxiang Che. A survey of table reasoning with large language models, 2024

  11. [11]

    Compositional semantic parsing on semi-structured tables

    Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. In Chengqing Zong and Michael Strube, editors, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470–1480, Beijing, China, July

  12. [13]

    Tabfact: A large-scale dataset for table-based fact verification, 2020

    Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification, 2020

  13. [14]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

  14. [15]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023

  15. [16]

    Large language models are few(1)-shot table reasoners

    Wenhu Chen. Large language models are few(1)-shot table reasoners. In Andreas Vlachos and Isabelle Augenstein, editors, Findings of the Association for Computational Linguistics: EACL 2023, pages 1120–1130, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. 10

  16. [17]

    Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning, 2023

    Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning, 2023

  17. [18]

    Chain- of-table: Evolving tables in the reasoning chain for table understanding, 2024

    Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. Chain- of-table: Evolving tables in the reasoning chain for table understanding, 2024

  18. [19]

    Smith, and Tao Yu

    Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. Binding language models in symbolic languages, 2023

  19. [20]

    Wang, and Xi Victoria Lin

    Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen tau Yih, Sida I. Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution, 2023

  20. [21]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022

  21. [22]

    Le, Ed H

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challeng- ing big-bench tasks and whether chain-of-thought can solve them, 2022

  22. [23]

    Evaluating open-domain question answering in the era of large language models

    Ehsan Kamalloo, Nouha Dziri, Charles Clarke, and Davood Rafiei. Evaluating open-domain question answering in the era of large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada, July

  23. [25]

    Foun- dation models for decision making: Problems, methods, and opportunities, 2023

    Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. Foun- dation models for decision making: Problems, methods, and opportunities, 2023

  24. [26]

    Large language models for mathematical reasoning: Progresses and challenges

    Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. In Neele Falk, Sara Papi, and Mike Zhang, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 225–237, St. Ju...

  25. [27]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  26. [28]

    Least-to-most prompting enables complex reasoning in large language models, 2023

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models, 2023

  27. [29]

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023

  28. [30]

    Self-consistency improves chain of thought reasoning in language models, 2023

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023

  29. [31]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceed- ings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, March 2024

  30. [32]

    GraphReason: Enhancing reasoning capabilities of large language models through a graph-based verification approach

    Lang Cao. GraphReason: Enhancing reasoning capabilities of large language models through a graph-based verification approach. In Bhavana Dalvi Mishra, Greg Durrett, Peter Jansen, Ben Lipkin, Danilo Neves Ribeiro, Lionel Wong, Xi Ye, and Wenting Zhao, editors,Proceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL 20...

  31. [33]

    Let’s verify step by step, 2023

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

  32. [34]

    Solving math word problems with process- and outcome-based feedback, 2022

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022

  33. [35]

    Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

  34. [36]

    K., M¨uller, T., Piccinno, F., and Eisen- schlos, J

    Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, rancesco Piccinno, and Julian Martin Eisenschlos. TAPAS: weakly supervised table parsing via pre-training. CoRR, abs/2004.02349, 2020

  35. [37]

    Pasta: Table- operations aware fact verification via sentence-table cloze pre-training, 2022

    Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, and Xiaoyong Du. Pasta: Table- operations aware fact verification via sentence-table cloze pre-training, 2022

  36. [38]

    Tuta: Tree-based transformers for generally structured table pre-training

    Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu, Shi Han, and Dongmei Zhang. Tuta: Tree-based transformers for generally structured table pre-training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1780–1790, 2021

  37. [39]

    Tapex: Table pre-training via learning a neural sql executor, 2022

    Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, and Jian-Guang Lou. Tapex: Table pre-training via learning a neural sql executor, 2022

  38. [40]

    Tablellama: Towards open large generalist models for tables, 2024

    Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. Tablellama: Towards open large generalist models for tables, 2024

  39. [41]

    Tablegpt: Towards unifying tables, nature language and commands into one gpt, 2023

    Liangyu Zha, Junlin Zhou, Liyao Li, Rui Wang, Qingyi Huang, Saisai Yang, Jing Yuan, Changbao Su, Xiang Li, Aofeng Su, Tao Zhang, Chen Zhou, Kaizhe Shou, Miao Wang, Wufang Zhu, Guoshan Lu, Chao Ye, Yali Ye, Wentao Ye, Yiming Zhang, Xinglong Deng, Jie Xu, Haobo Wang, Gang Chen, and Junbo Zhao. Tablegpt: Towards unifying tables, nature language and commands ...

  40. [42]

    Huang, Jie Fu, Xiang Yue, and Wenhu Chen

    Alex Zhuang, Ge Zhang, Tianyu Zheng, Xinrun Du, Junjie Wang, Weiming Ren, Stephen W. Huang, Jie Fu, Xiang Yue, and Wenhu Chen. Structlm: Towards building generalist models for structured knowledge grounding, 2024

  41. [43]

    Potable: Program- ming standardly on table-based reasoning like a human analyst, 2024

    Qingyang Mao, Qi Liu, Zhi Li, Mingyue Cheng, Zheng Zhang, and Rui Li. Potable: Program- ming standardly on table-based reasoning like a human analyst, 2024

  42. [44]

    TabSQLify: Enhancing reasoning capabilities of LLMs through table decomposition

    Md Nahid and Davood Rafiei. TabSQLify: Enhancing reasoning capabilities of LLMs through table decomposition. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages 5725–5737, Mexic...

  43. [45]

    Yunjia Zhang, Jordan Henkel, Avrilia Floratou, Joyce Cahoon, Shaleen Deep, and Jignesh M. Patel. Reactable: Enhancing react for table question answering, 2023

  44. [46]

    TAP4LLM: Table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning

    Yuan Sui, Jiaru Zou, Mengyu Zhou, Xinyi He, Lun Du, Shi Han, and Dongmei Zhang. TAP4LLM: Table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10306–10323, Miami,...

  45. [47]

    Tree-of-table: Unleashing the power of llms for enhanced large-scale table understanding, 2024

    Deyi Ji, Lanyun Zhu, Siqi Gao, Peng Xu, Hongtao Lu, Jieping Ye, and Feng Zhao. Tree-of-table: Unleashing the power of llms for enhanced large-scale table understanding, 2024

  46. [48]

    Rethinking tabular data understanding with large language models

    Tianyang Liu, Fei Wang, and Muhao Chen. Rethinking tabular data understanding with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 450–482, Mexico City...

  47. [49]

    Encoding spreadsheets for large language models

    Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Mengyu Zhou, Yun Lin, José Cam- bronero, Yeye He, Shi Han, and Dongmei Zhang. Encoding spreadsheets for large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20728–20748, Miami, F...

  48. [50]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transac- tions on Information Systems, nov 2024

  49. [51]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  50. [52]

    Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks, 2022

    Haoyu Dong, Zhoujun Cheng, Xinyi He, Mengyu Zhou, Anda Zhou, Fan Zhou, Ao Liu, Shi Han, and Dongmei Zhang. Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks, 2022

  51. [53]

    Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das

    Ankur P. Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. Totto: A controlled table-to-text generation dataset, 2020

  52. [54]

    Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, and Faramarz Fekri. Can LLMs reason in the wild with programs? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9806–9829, Miami, Florida, USA, November 2024. Association for Computational Linguistics

  53. [55]

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

  54. [56]

    Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism

    Lang Cao. Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3628–3646, Miami, Florida, USA, November

  55. [57]

    Association for Computational Linguistics

  56. [58]

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don’t know? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, July 2023. Association for Computational Linguistics

  57. [59]

    Evaluating the text-to-sql capabilities of large language models, 2022

    Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. Evaluating the text-to-sql capabilities of large language models, 2022

  58. [60]

    FeTaQA: Free-form table question answering

    Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kry´sci´nski, Hailey Schoelkopf, Riley Kong, Xiangru Tang, Mutethia Mutuma, Ben Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, Dragomir Radev, and Dragomir Radev. FeTaQA: Free-form table question answering. Transactions of the Association...

  59. [61]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Compu...

  60. [62]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

  61. [63]

    HiTab: A hierarchical table dataset for question answering and natural language generation

    Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. HiTab: A hierarchical table dataset for question answering and natural language generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Lingui...

  62. [64]

    Finqa: A dataset of numerical reasoning over financial data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. Finqa: A dataset of numerical reasoning over financial data. Proceedings of EMNLP 2021, 2021. 13

  63. [65]

    Medec: A benchmark for medical error detection and correction in clinical notes, 2025

    Asma Ben Abacha, Wen wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, and Thomas Lin. Medec: A benchmark for medical error detection and correction in clinical notes, 2025

  64. [66]

    NormTab: Improving symbolic reasoning in LLMs through tabular data normalization

    Md Mahadi Hasan Nahid and Davood Rafiei. NormTab: Improving symbolic reasoning in LLMs through tabular data normalization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3569–3585, Miami, Florida, USA, November 2024. Association for Computational Linguistics

  65. [67]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  66. [68]

    Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H

    Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...

  67. [69]

    Benchmarking large language model capabilities for conditional generation

    Joshua Maynez, Priyanka Agrawal, and Sebastian Gehrmann. Benchmarking large language model capabilities for conditional generation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 9194–9213, Toronto, Canada, July 2023. A...

  68. [70]

    MultiCoT: Chain-of-table reasoning with multiple tables

    CYQIQ. MultiCoT: Chain-of-table reasoning with multiple tables. https://github.com/ CYQIQ/MultiCoT, 2025. GitHub repository

  69. [71]

    e5: Zero-shot hierarchical table analysis using augmented LLMs via explain, extract, execute, exhibit and extrapolate

    Zhehao Zhang, Yan Gao, and Jian-Guang Lou. e5: Zero-shot hierarchical table analysis using augmented LLMs via explain, extract, execute, exhibit and extrapolate. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...

  70. [72]

    Target-data localization

  71. [73]

    Numerical inaccuracy in textual reasoning

  72. [74]

    After Converting

    Semantic rigidity in symbolic reasoning For each challenge we propose a dedicated, minimal solution, whereas earlier efforts typically address only one point and overlook the others. Efficient subtable extraction and symbolic reasoning. Prior systems rely on elaborate heuristics for subtable extraction, which often lose information. TableMaster instead co...

  73. [75]

    Sylvain Geboersand Roger De Coster lead with three wins each, followed by Joel Robert with one win, while Gaston Rahierhas not won any races

    Code Execution WinsPointsTeamCountryRiderPlace 33066SuzukiBelgiumSylvain Geboers1 22331MaicoGermanyAdolf Weil2 02052HusqvarnaSwedenTorlief Hansen3 31865SuzukiBelgiumRoger De Coster4 11730SuzukiBelgiumJoel Robert5 21680HusqvarnaFinlandHeikki Mikkola6 01276MaicoGermanyWilly Bauer7 01112ČZBelgiumGaston Rahier8 01110HusqvarnaNetherlandsPierre Karsmakers9 0107...

  74. [76]

    topheaders

    Identify the Relevant Column–Locate the “Wins” column (C).2. Extract the Wins Data–Retrieve the win values for Belgian riders.3. Convert Wins to Numeric Values–Ensure all values are in numeric format.4. Sum the Wins–Add up the total number of wins.5. Calculate the Total–Perform the addition.6. Verify the Calculation–Double-check for accuracy.7. Present th...