Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

Amritansh Maurya; Mohammed Javed; Navjot Singh; Omar Moured

arxiv: 2605.20254 · v1 · pith:IBB76LLPnew · submitted 2026-05-18 · 💻 cs.IR · cs.AI· cs.CV· cs.LG

Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

Amritansh Maurya , Navjot Singh , Mohammed Javed , Omar Moured This is my paper

Pith reviewed 2026-05-21 08:09 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CVcs.LG

keywords table question answeringstructured promptingtable navigationprogressive inferencetraining-free methodslarge language modelstable benchmarks

0 comments

The pith

Two structured prompting methods let large language models navigate tables to answer questions more accurately without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training-free way to improve how large language models handle questions about data stored in tables. It introduces TableGrid Navigation, which uses a repeating three-part loop to scan rows and columns, gather supporting cells, and update the answer. It also introduces Progressive Inference Prompting, which first requires the model to name the relevant columns and then move through rows in order. These techniques are tested on two standard table datasets against several existing methods and show measurable gains. A sympathetic reader would care because the approach keeps the original model weights fixed while adding clear control over how the model examines the table structure.

Core claim

The authors propose TableGrid Navigation (TGN), an iterative three-module loop that moves across rows and columns to locate evidence cells and refine answers, together with Progressive Inference Prompting (PIP), which first identifies columns and then applies an explicit progressive row-selection constraint based on the query. These prompting frameworks produce higher accuracy than prior baselines on TableBench and FeTaQa while remaining training-free.

What carries the argument

TableGrid Navigation (TGN) and Progressive Inference Prompting (PIP), which impose an iterative navigation loop and a column-first progressive selection constraint to guide the model's interaction with table cells.

If this is right

TGN raises accuracy on TableBench by 3.8 points over the strongest prior baseline.
PIP reaches state-of-the-art results on FeTaQa compared with ReAct and Chain-of-Thought prompting.
The same prompt templates can be used as supervision signals to fine-tune smaller models and shrink the gap to larger ones.
The methods supply a cost-efficient, inference-only route to table question answering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same navigation loop could be adapted to other grid-like data such as spreadsheets or web tables without changing the underlying model.
Because the prompts are explicit, they might serve as a lightweight way to audit or debug the model's reasoning path on individual questions.
Combining the column-first constraint with existing chain-of-thought techniques could produce hybrid prompts that further limit hallucinated cells.

Load-bearing premise

Large language models will reliably follow the iterative navigation loops and column-first progressive selection rules in the prompts without skipping steps or inventing intermediate results.

What would settle it

Run the same questions while logging every intermediate cell or column the model claims to visit; if many final answers cite cells that were never actually named in the logged path, the claim that the structured prompts control retrieval would be weakened.

Figures

Figures reproduced from arXiv: 2605.20254 by Amritansh Maurya, Mohammed Javed, Navjot Singh, Omar Moured.

**Figure 1.** Figure 1: Comparison of 4 prompting baselines, (a) Direct Prompting (DP), (b) Tree-ofThought (ToT), (c) Chain-of-Thought (CoT), and (d) Reason+Act (ReAct), solving a TableBench [24] question. The recent evolution of language models [15] such as GPT, Llama [7], Qwen [27, 17, 26] has transformed natural language tasks, showcasing an extraordinary capacity to handle complex linguistic tasks, generating immense interes… view at source ↗

**Figure 2.** Figure 2: Framework of PIP, TGN and flow diagram of using prompting strategies for inference. The state Sn ∈ S at iteration n, initialized as S0 = ∅, representing the initial state with no prior computations and the state transition function Tn : S × Q × T → S at iteration n, can be defined as: Tn(Sn−1, Q, T) = Vn(En(An(Q, T, Sn−1), T), T) (3) Where, the analysis function An(Q, T, Sn−1) : Q × T × S → Pn generates a … view at source ↗

read the original abstract

Large Language Models (LLMs) have shown promising results on NLP tasks, however, their performance on tabular data still needs research attention, because Table Question-Answering (TQA) requires precise cell retrieval and multi-step structured reasoning. Existing work improves TQA either by fine-tuning or training LLMs on task-specific tabular data, but often lacks verifiable control over how the model navigates tables and derives answers. In this work, we propose a training-free TQA approach with two structured prompting frameworks: TableGrid Navigation (TGN), which iteratively navigates rows and columns via a three-module loop to locate evidence and refine answers, and Progressive Inference Prompting (PIP), which enforces columns identification for explicit progressive row selection constraint according to the query. We evaluate 17 LLMs against 6 baselines on TableBench and FeTaQa dataset. On TableBench, TGN improves over the strongest baseline by 3.8 points, and on FeTaQa, PIP achieves SOTA performance over ReAct and Chain-of-Thought. Beyond inference-time gains, PIP and TGN can also serve as supervision templates to fine-tune small models, narrowing the performance gap to much larger architectures in resource-constrained settings, offering versatile and cost-efficient solution for TQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two new prompting loops for table QA deliver small benchmark gains but the attribution to the exact structures remains shaky without adherence checks.

read the letter

The paper puts forward TableGrid Navigation and Progressive Inference Prompting as training-free templates for table question answering. TGN runs a repeated three-module cycle that moves across rows and columns to gather evidence and tighten the answer. PIP forces the model to name relevant columns first, then selects rows step by step according to the question. Both are presented as concrete additions to the existing prompting repertoire rather than new model architectures or training regimes.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces two training-free prompting strategies for Table Question Answering (TQA): TableGrid Navigation (TGN), which employs a three-module iterative loop for row and column navigation to locate evidence and refine answers, and Progressive Inference Prompting (PIP), which mandates column identification prior to progressive row selection based on the query. The authors evaluate these frameworks using 17 large language models on the TableBench and FeTaQa datasets, claiming a 3.8-point improvement over the strongest baseline on TableBench with TGN and state-of-the-art performance with PIP on FeTaQa relative to ReAct and Chain-of-Thought. Additionally, the methods are proposed as templates for fine-tuning smaller models to close the gap with larger architectures.

Significance. The evaluation across 17 LLMs provides broad evidence of applicability and is a strength of the work. If the gains can be shown to arise specifically from the structured navigation and constraints rather than generic prompting effects, the approach offers a practical, cost-efficient inference-time solution for TQA with additional utility as supervision templates for fine-tuning. The central empirical claims would then represent a meaningful incremental advance in controllable table reasoning.

major comments (2)

[Evaluation] Evaluation section: The reported gains (3.8 points on TableBench for TGN; SOTA on FeTaQa for PIP) are presented without details on the exact six baselines, error bars, statistical significance tests, or any post-hoc selection criteria. This absence prevents verification of the central performance claims and their robustness.
[Method] Method section (TGN and PIP descriptions): The attribution of improvements to the three-module iterative navigation loop in TGN and the column-first progressive selection constraint in PIP rests on the assumption that LLMs faithfully execute these structures. No quantitative adherence metrics (e.g., fraction of traces staying within the prescribed module sequence or avoiding hallucinated cells) are provided, leaving open the possibility that gains reflect general LLM reasoning rather than the proposed frameworks.

minor comments (2)

[Abstract] Abstract: The claim of evaluation 'against 6 baselines' would be clearer if the baseline names were listed explicitly.
[Throughout] Notation: Ensure TGN and PIP are defined at first use and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The reported gains (3.8 points on TableBench for TGN; SOTA on FeTaQa for PIP) are presented without details on the exact six baselines, error bars, statistical significance tests, or any post-hoc selection criteria. This absence prevents verification of the central performance claims and their robustness.

Authors: We agree that additional details are needed for full verifiability. The six baselines are ReAct, Chain-of-Thought, standard prompting, and three table-specific variants described in Section 4; we will explicitly list them with one-sentence descriptions in the revised Evaluation section. We will also add error bars (standard deviation over three runs for the main results), report paired statistical significance tests (McNemar’s test) for the 3.8-point gain on TableBench and the SOTA claim on FeTaQa, and state that no post-hoc model or baseline selection was performed beyond reporting the strongest baseline as written. These elements will appear in the main text and a supplementary table. revision: yes
Referee: [Method] Method section (TGN and PIP descriptions): The attribution of improvements to the three-module iterative navigation loop in TGN and the column-first progressive selection constraint in PIP rests on the assumption that LLMs faithfully execute these structures. No quantitative adherence metrics (e.g., fraction of traces staying within the prescribed module sequence or avoiding hallucinated cells) are provided, leaving open the possibility that gains reflect general LLM reasoning rather than the proposed frameworks.

Authors: We acknowledge the value of quantitative adherence evidence. While the paper includes illustrative execution traces showing that models follow the prescribed module sequences and cell references, we did not compute aggregate statistics such as adherence rate or hallucinated-cell frequency across all 17 models. In the revision we will add a short discussion of observed adherence based on manual review of 50 randomly sampled traces per framework, include additional full traces in the appendix, and explicitly note the absence of large-scale quantitative adherence metrics as a limitation. We maintain that the performance gap versus less-constrained baselines (CoT, ReAct) supports the contribution of the specific structures, but agree that direct adherence quantification would further isolate the effect. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical prompting evaluation is self-contained

full rationale

The paper proposes two prompting frameworks (TGN with its three-module iterative navigation loop and PIP with column-first progressive selection) and reports performance gains via direct empirical comparisons on TableBench and FeTaQa against external baselines such as ReAct and Chain-of-Thought. No mathematical derivations, fitted parameters, self-citations forming load-bearing chains, or reductions of results to inputs by construction appear in the described approach. The central claims rest on verifiable benchmark scores rather than any self-referential loop, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach relies on the unstated domain assumption that LLMs possess sufficient instruction-following capability to execute multi-step table navigation without external verification or training.

axioms (1)

domain assumption Large language models can reliably execute complex iterative navigation and progressive selection instructions in zero-shot prompting.
Central to both TGN and PIP success; invoked implicitly throughout the proposed frameworks.

invented entities (2)

TableGrid Navigation (TGN) no independent evidence
purpose: Iterative row-column navigation loop for evidence location in tables
New method introduced to structure LLM reasoning over tabular data.
Progressive Inference Prompting (PIP) no independent evidence
purpose: Column identification followed by constrained row selection
New method introduced to enforce explicit progressive reasoning.

pith-pipeline@v0.9.0 · 5768 in / 1303 out tokens · 38727 ms · 2026-05-21T08:09:34.762781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 10 internal anchors

[1]

In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

work page 2005
[2]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

work page 1901
[3]

In: Findings of the Association for Computational Linguistics: ACL 2023

Chen, W.: Large language models are few(1)-shot table reasoners. In: Vlachos, A., Augenstein, I. (eds.) Findings of the Association for Computational Lin- guistics: EACL 2023. pp. 1120–1130. Association for Computational Linguis- tics, Dubrovnik, Croatia (May 2023). https://doi.org/10.18653/v1/2023.findings- eacl.83, https://aclanthology.org/2023.findings...

work page doi:10.18653/v1/2023.findings- 2023
[4]

Journal of Machine Learning Research25(70), 1–53 (2024)

Chung, H.W., Hou, L., Longpre, S., et al.: Scaling instruction-finetuned language models. Journal of Machine Learning Research25(70), 1–53 (2024)

work page 2024
[5]

arXiv preprint arXiv:2402.12424 (2024)

Deng, N., Sun, Z., He, R., Sikka, A., Chen, Y., Ma, L., Zhang, Y., Mihalcea, R.: Tables as texts or images: Evaluating the table reasoning ability of llms and mllms. arXiv preprint arXiv:2402.12424 (2024)

work page arXiv 2024
[6]

Transactions on Machine Learning Research (2024), https://openreview.net/forum?id=IZnrCGF9WI

Fang, X., Xu, W., Tan, F.A., Hu, Z., Zhang, J., Qi, Y., Sengamedu, S.H., Falout- sos, C.: Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey. Transactions on Machine Learning Research (2024), https://openreview.net/forum?id=IZnrCGF9WI

work page 2024
[7]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo,D.,Yang,D.,Zhang,H.,etal.:Deepseek-r1:Incentivizingreasoningcapability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

In: China Conference on Knowledge Graph and Semantic Computing

Jin, N., Siebert, J., Li, D., Chen, Q.: A survey on table question answering: recent advances. In: China Conference on Knowledge Graph and Semantic Computing. pp. 174–186. Springer (2022)

work page 2022
[10]

Advances in neural information processing systems35, 22199–22213 (2022)

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems35, 22199–22213 (2022)

work page 2022
[11]

In: Ntcir workshop

Lin, C.Y., Och, F.: Looking for a few good metrics: Rouge and its evaluation. In: Ntcir workshop. pp. 1–8 (2004)

work page 2004
[12]

The Journal of Academic Librarianship49(4), 102720 (2023)

Lo, L.S.: The clear path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship49(4), 102720 (2023)

work page 2023
[13]

The Journal of Academic Librarianship49(4), 102720 (2023)

Lo, L.S.: The clear path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship49(4), 102720 (2023). https://doi.org/https://doi.org/10.1016/j.acalib.2023.102720, https://www.sciencedirect.com/science/article/pii/S0099133323000599 Efficient Table QA via TGN and PIP 13

work page doi:10.1016/j.acalib.2023.102720 2023
[14]

Transactions of the Association for Computational Linguistics10, 35–49 (2022)

Nan, L., Hsieh, C., Mao, Z., Lin, X.V., Verma, N., Zhang, R., Kryściński, W., Schoelkopf, H., Kong, R., Tang, X., et al.: Fetaqa: Free-form table question an- swering. Transactions of the Association for Computational Linguistics10, 35–49 (2022)

work page 2022
[15]

Deep contextualized word representations

Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word represen- tations. arxiv 2018. arXiv preprint arXiv:1802.0536512(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

In: Proceedings of the third conference on machine translation: Research papers

Post, M.: A call for clarity in reporting bleu scores. In: Proceedings of the third conference on machine translation: Research papers. pp. 186–191 (2018)

work page 2018
[17]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., et al.: Qwen2.5 technical report (2025), https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Schulhoff, S., Ilie, M., Balepur, N., et al.: The prompt report: A systematic survey of prompt engineering techniques (2025), https://arxiv.org/abs/2406.06608

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

Sellam, T., Das, D., Parikh, A.: Bleurt: Learning robust metrics for text generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 7881–7892 (2020)

work page 2020
[20]

Artificial Intelligence Review53(7), 5429–5453 (2020)

daSilva,J.W.F.,Venceslau,A.D.P.,Sales,J.E.,etal.:Ashortsurveyonend-to-end simple question answering systems. Artificial Intelligence Review53(7), 5429–5453 (2020)

work page 2020
[21]

Su, A., Wang, A., Ye, C., et al.: Tablegpt2: A large multimodal model with tabular data integration (2024), https://arxiv.org/abs/2411.02059

work page arXiv 2024
[22]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits rea- soninginlargelanguagemodels.Advancesinneuralinformationprocessingsystems 35, 24824–24837 (2022)

work page 2022
[24]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wu, X., Yang, J., Chai, L., et al.: Tablebench: A comprehensive and complex benchmark for table question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 25497–25506 (2025)

work page 2025
[25]

arXiv preprint arXiv:2405.18357 (2024)

Xu, J., Fei, H., Pan, L., Liu, Q., Lee, M.L., Hsu, W.: Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357 (2024)

work page arXiv 2024
[26]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., et al.: Qwen2 technical report (2024), https://arxiv.org/abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Advances in neural information processing systems 36, 11809–11822 (2023)

Yao, S., Yu, D., Zhao, J., et al.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems 36, 11809–11822 (2023)

work page 2023
[29]

In: International Conference on Learning Representations (ICLR) (2023)

Yao,S.,Zhao,J.,Yu,D.,etal.:React:Synergizingreasoningandactinginlanguage models. In: International Conference on Learning Representations (ICLR) (2023)

work page 2023
[30]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[31]

In: Findings of the Association for Computational Linguistics: ACL 2025

Zhang, X., Luo, S., Zhang, B., Ma, Z., Zhang, J., Li, Y., Li, G., Yao, Z., Xu, K., Zhou, J., et al.: Tablellm: Enabling tabular data manipulation by llms in real office usage scenarios. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 10315–10344 (2025)

work page 2025
[32]

Frontiers of Computer Science19(9), 199348 (2025)

Zhang, X., Wang, D., Dou, L., Zhu, Q., Che, W.: A survey of table reasoning with large language models. Frontiers of Computer Science19(9), 199348 (2025)

work page 2025
[33]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.: Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022) 14 A. Maurya et al. A Prompts In this section, we present demonstration used across TableBench dataset. We select the same a...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

work page 2005

[2] [2]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

work page 1901

[3] [3]

In: Findings of the Association for Computational Linguistics: ACL 2023

Chen, W.: Large language models are few(1)-shot table reasoners. In: Vlachos, A., Augenstein, I. (eds.) Findings of the Association for Computational Lin- guistics: EACL 2023. pp. 1120–1130. Association for Computational Linguis- tics, Dubrovnik, Croatia (May 2023). https://doi.org/10.18653/v1/2023.findings- eacl.83, https://aclanthology.org/2023.findings...

work page doi:10.18653/v1/2023.findings- 2023

[4] [4]

Journal of Machine Learning Research25(70), 1–53 (2024)

Chung, H.W., Hou, L., Longpre, S., et al.: Scaling instruction-finetuned language models. Journal of Machine Learning Research25(70), 1–53 (2024)

work page 2024

[5] [5]

arXiv preprint arXiv:2402.12424 (2024)

Deng, N., Sun, Z., He, R., Sikka, A., Chen, Y., Ma, L., Zhang, Y., Mihalcea, R.: Tables as texts or images: Evaluating the table reasoning ability of llms and mllms. arXiv preprint arXiv:2402.12424 (2024)

work page arXiv 2024

[6] [6]

Transactions on Machine Learning Research (2024), https://openreview.net/forum?id=IZnrCGF9WI

Fang, X., Xu, W., Tan, F.A., Hu, Z., Zhang, J., Qi, Y., Sengamedu, S.H., Falout- sos, C.: Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey. Transactions on Machine Learning Research (2024), https://openreview.net/forum?id=IZnrCGF9WI

work page 2024

[7] [7]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo,D.,Yang,D.,Zhang,H.,etal.:Deepseek-r1:Incentivizingreasoningcapability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

In: China Conference on Knowledge Graph and Semantic Computing

Jin, N., Siebert, J., Li, D., Chen, Q.: A survey on table question answering: recent advances. In: China Conference on Knowledge Graph and Semantic Computing. pp. 174–186. Springer (2022)

work page 2022

[10] [10]

Advances in neural information processing systems35, 22199–22213 (2022)

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems35, 22199–22213 (2022)

work page 2022

[11] [11]

In: Ntcir workshop

Lin, C.Y., Och, F.: Looking for a few good metrics: Rouge and its evaluation. In: Ntcir workshop. pp. 1–8 (2004)

work page 2004

[12] [12]

The Journal of Academic Librarianship49(4), 102720 (2023)

Lo, L.S.: The clear path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship49(4), 102720 (2023)

work page 2023

[13] [13]

The Journal of Academic Librarianship49(4), 102720 (2023)

Lo, L.S.: The clear path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship49(4), 102720 (2023). https://doi.org/https://doi.org/10.1016/j.acalib.2023.102720, https://www.sciencedirect.com/science/article/pii/S0099133323000599 Efficient Table QA via TGN and PIP 13

work page doi:10.1016/j.acalib.2023.102720 2023

[14] [14]

Transactions of the Association for Computational Linguistics10, 35–49 (2022)

Nan, L., Hsieh, C., Mao, Z., Lin, X.V., Verma, N., Zhang, R., Kryściński, W., Schoelkopf, H., Kong, R., Tang, X., et al.: Fetaqa: Free-form table question an- swering. Transactions of the Association for Computational Linguistics10, 35–49 (2022)

work page 2022

[15] [15]

Deep contextualized word representations

Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word represen- tations. arxiv 2018. arXiv preprint arXiv:1802.0536512(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

In: Proceedings of the third conference on machine translation: Research papers

Post, M.: A call for clarity in reporting bleu scores. In: Proceedings of the third conference on machine translation: Research papers. pp. 186–191 (2018)

work page 2018

[17] [17]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., et al.: Qwen2.5 technical report (2025), https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Schulhoff, S., Ilie, M., Balepur, N., et al.: The prompt report: A systematic survey of prompt engineering techniques (2025), https://arxiv.org/abs/2406.06608

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

Sellam, T., Das, D., Parikh, A.: Bleurt: Learning robust metrics for text generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 7881–7892 (2020)

work page 2020

[20] [20]

Artificial Intelligence Review53(7), 5429–5453 (2020)

daSilva,J.W.F.,Venceslau,A.D.P.,Sales,J.E.,etal.:Ashortsurveyonend-to-end simple question answering systems. Artificial Intelligence Review53(7), 5429–5453 (2020)

work page 2020

[21] [21]

Su, A., Wang, A., Ye, C., et al.: Tablegpt2: A large multimodal model with tabular data integration (2024), https://arxiv.org/abs/2411.02059

work page arXiv 2024

[22] [22]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits rea- soninginlargelanguagemodels.Advancesinneuralinformationprocessingsystems 35, 24824–24837 (2022)

work page 2022

[24] [24]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wu, X., Yang, J., Chai, L., et al.: Tablebench: A comprehensive and complex benchmark for table question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 25497–25506 (2025)

work page 2025

[25] [25]

arXiv preprint arXiv:2405.18357 (2024)

Xu, J., Fei, H., Pan, L., Liu, Q., Lee, M.L., Hsu, W.: Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357 (2024)

work page arXiv 2024

[26] [26]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., et al.: Qwen2 technical report (2024), https://arxiv.org/abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Advances in neural information processing systems 36, 11809–11822 (2023)

Yao, S., Yu, D., Zhao, J., et al.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems 36, 11809–11822 (2023)

work page 2023

[29] [29]

In: International Conference on Learning Representations (ICLR) (2023)

Yao,S.,Zhao,J.,Yu,D.,etal.:React:Synergizingreasoningandactinginlanguage models. In: International Conference on Learning Representations (ICLR) (2023)

work page 2023

[30] [30]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[31] [31]

In: Findings of the Association for Computational Linguistics: ACL 2025

Zhang, X., Luo, S., Zhang, B., Ma, Z., Zhang, J., Li, Y., Li, G., Yao, Z., Xu, K., Zhou, J., et al.: Tablellm: Enabling tabular data manipulation by llms in real office usage scenarios. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 10315–10344 (2025)

work page 2025

[32] [32]

Frontiers of Computer Science19(9), 199348 (2025)

Zhang, X., Wang, D., Dou, L., Zhu, Q., Che, W.: A survey of table reasoning with large language models. Frontiers of Computer Science19(9), 199348 (2025)

work page 2025

[33] [33]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.: Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022) 14 A. Maurya et al. A Prompts In this section, we present demonstration used across TableBench dataset. We select the same a...

work page internal anchor Pith review Pith/arXiv arXiv 2022