pith. sign in

arxiv: 2509.17680 · v2 · pith:FFPGXXUQnew · submitted 2025-09-22 · 💻 cs.CL

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

Pith reviewed 2026-05-21 21:33 UTC · model grok-4.3

classification 💻 cs.CL
keywords TableQAdenoisinglarge language modelsquestion decompositiontable pruningevidence treenoise handling
0
0 comments X

The pith

A dual denoising framework improves TableQA by cleaning complex questions and pruning large noisy tables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that noise from complex questions and large tables severely hurts LLM reasoning in TableQA, and that this can be addressed by separately denoising the question and the table. It decomposes each question into minimal semantic units then drops those that fail consistency or usability checks for answer reasoning. It then builds an evidence tree to guide step-by-step table pruning, using a post-order rollback step to correct any abnormal intermediate table states before final answer generation. A sympathetic reader would care because real-world tables are big and questions are intricate, so any reliable way to retain only the relevant evidence while shrinking the input helps LLMs produce correct answers without being overwhelmed by irrelevant data.

Core claim

EnoTab performs evidence-based question denoising by breaking the input question into minimal semantic units and retaining only those that meet consistency and usability criteria for answer reasoning, then applies evidence tree-guided table denoising that constructs an explicit pruning path, observes each intermediate table state, and invokes a post-order node rollback mechanism whenever an abnormal state appears, ultimately yielding a compact reliable sub-table for final reasoning.

What carries the argument

Evidence Tree-guided Table Denoising with post-order node rollback, which builds an explicit pruning path and corrects abnormal intermediate table states step by step.

If this is right

  • Question decomposition into semantic units can isolate and discard only the parts that do not contribute to answer reasoning.
  • An evidence tree supplies a transparent, step-wise path for removing irrelevant rows and columns while preserving essential content.
  • Rollback on abnormal intermediate table states prevents error propagation during pruning.
  • The resulting compact sub-table allows the LLM to reason more reliably without being distracted by noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-denoising pattern could be tested on other structured-data tasks such as knowledge-base question answering or spreadsheet formula generation.
  • Rollback could be generalized to other tree-guided search procedures that risk entering invalid states.
  • If the semantic-unit decomposition proves robust, it might serve as a lightweight pre-processing step for any LLM pipeline that receives long or noisy inputs.

Load-bearing premise

Decomposing questions into minimal semantic units and filtering them with consistency and usability criteria will keep every piece of information needed for correct reasoning while removing noise.

What would settle it

Run the method on a benchmark dataset of complex questions over large tables and check whether the final answers are less accurate than those produced by the same LLM on the original un-denoised inputs.

Figures

Figures reproduced from arXiv: 2509.17680 by Dong Jin, Jian Yang, Shenghao Ye, Shuangwu Chen, Xiaofeng Jiang, Yikai Shen, Yu Guo, Yunpeng Hou.

Figure 1
Figure 1. Figure 1: Error and Correct Cases for (a) the Complex [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The EnoTab framework, composed of Evidence-based Question Denoising to remove irrelevant semantic [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of question difficulty levels in the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Execution accuracy of closed-source and open [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Usability Assessment of Evidence. A.5 Rollback Mechanism: Analysis of And2Or The rollback mechanism is designed to prevent the loss of essential information during table prun￾ing. In particular, the And2Or operation acts as a fallback strategy when an internal node with an AND operator produces an empty subtable. This situation indicates that the system has reached the limits of its discriminative capabili… view at source ↗
Figure 7
Figure 7. Figure 7: An illustrative Evidence Tree with four leaf [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Statistics for the SQL execution errors. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Execution accuracy on WikiTQ with noisy content in tables. B.3 Noisy Content To further evaluate EnoTab’s robustness to noisy data, we follow the approach proposed in Binder(Cheng et al., 2023) and construct a noisy version of the WikiTQ development set by inject￾ing distractive content to simulate misleading infor￾mation in real-world scenarios. We evaluate both EnoTab and a standard End-to-End QA method… view at source ↗
Figure 11
Figure 11. Figure 11: Prompt of Cell-based QA Generation. Prompt of Row-based QA Generation ### Instruction: You are a QA generation expert. Your task is to use selected row above to generate one question using the information within this row. Return the question and its answer. Repeat this process 4 times. ### Table: Header: [HEADER] Content: [TAB] Selected Row: [ROW] ### Note: -Directly returns questions in format "Q: questi… view at source ↗
Figure 12
Figure 12. Figure 12: Prompt of Row-based QA Generation [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt of Colmn-based QA Generation. Prompt of Subtable-based QA Generation ### Instruction: You are a QA generation expert. Your task is to use sub-table above to generate one question using the information within this sub-table. Return the question and its answer. Repeat this process 4 times. ### Table: Header: [HEADER] Content: [TAB] Selected Subtable: [SUB] ### Note: -Directly returns questions in for… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt of Subtable-based QA Generation [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt of Table expansion with noise. Prompt of Evidence Generation ### Instruction: You are a task requirement understanding expert. Your task is to extract multiple pieces of evidence from the provided table and the question. ### Table: col : Common name | District | Hebrew | Arabic | Population\n(2009) | Area\n(km²) | Mayor row 1 : Acre | North | וכע | اكع | 46,300 | 13.533 | Shimon Lancry row 2 : Aful… view at source ↗
Figure 16
Figure 16. Figure 16: Prompt of Evidence Generation [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt of Tree Construction. Prompt of Final Query ### Instruction: You are a table question answering expert. Your task is to infer the answer to the question based on the provided table. ### Table: col : District | Population\n(2009) | Area\n(km²) row 1 : Center | 36300 | 2.756 row 2 : Tel Aviv | 40600 | 16.792 row 3 : Tel Aviv | 34400 | 5.141 row 4 : Tel Aviv | 31000 | 4.112 ### Question: How many citi… view at source ↗
Figure 18
Figure 18. Figure 18: Prompt of Final Query [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt of Table Verifier. Textural Evidence ### Question1: how many cities in Tel Aviv or Center have under 65,000 people? ### Evidence1: area [colmn2:District] action [textural] condition [Center in x] ### Question2: how many competitions were not in the united kingdom? ### Evidence2: area [colmn4:Country] action [textural] condition [United Kingdom not in x] ### Question3: what is the total number of fw… view at source ↗
Figure 20
Figure 20. Figure 20: Example of Textural Evidence [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Example of Numberical Evidence. Date Evidence ### Question1: how many games were played after october 1st? ### Evidence1: area [colmn1:Date] action [date] condition [x > 2008-10-01] ### Question2: what were the total number of deaths in 2003? ### Evidence2: area [colmn1:Year] action [date] condition [x == 2003-01-01] ### Question3: is the price money for 23 january 1984 more than that of 23 april 1984? ##… view at source ↗
Figure 22
Figure 22. Figure 22: Example of Date Evidence [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Error Case of ParserException. Error Case of BinderException ### Question: In cycle 4 of Austria's Next Top Model, how many contestants were older than 20? ### Table: col : Contestant | Age | Height | Home City | Rank row 1 : Alina Chlebecek | 18 | 170cm | Vienna | Eliminated in Episode 1 row 2 : Isabelle Raisa | 16 | 170cm | Vienna | Eliminated in Episode 1 row 3 : Sabrina Angelika Rauch | 21| 170cm | Vi… view at source ↗
Figure 24
Figure 24. Figure 24: Error Case of BinderException [PITH_FULL_IMAGE:figures/full_fig_p023_24.png] view at source ↗
read the original abstract

Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (LLMs) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while preserving essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table pruning path to remove irrelevant data step by step. At each pruning step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EnoTab, a dual denoising framework for TableQA on complex questions and large-scale tables. It first applies Evidence-based Question Denoising by decomposing the input question into minimal semantic units and filtering those deemed irrelevant via consistency and usability criteria. It then performs Evidence Tree-guided Table Denoising, constructing an explicit pruning path and applying a post-order node rollback mechanism to recover from abnormal intermediate table states, ultimately yielding a pruned sub-table for final LLM reasoning. The authors claim that this yields outstanding performance on TableQA tasks.

Significance. If the empirical claims hold, the work addresses a practically important gap in robust TableQA by explicitly targeting relevance filtering and table pruning under noise. The transparent evidence-tree construction and rollback mechanism constitute a concrete, interpretable contribution that could aid debugging and extension in LLM-based reasoning pipelines.

major comments (2)
  1. [§3.1] §3.1 (Evidence-based Question Denoising): The load-bearing assumption that decomposing questions into minimal semantic units and then dropping units via consistency/usability filters will retain every piece of information required for correct downstream reasoning is not obviously true for complex questions. Interdependent clauses that appear unusable in isolation can still be essential when combined; discarding them leaves the evidence tree with an incomplete set that later rollback or final reasoning cannot recover. This needs targeted validation (e.g., ablation on questions with known interdependencies or error analysis of cases where filtering removes critical context).
  2. [§4] §4 (Experiments and Results): The abstract asserts 'outstanding performance' and 'extensive experiments,' yet the strength of the central claim depends on showing that the dual-denoising pipeline, rather than other factors, drives the gains. The manuscript should report concrete metrics, strong baselines (including recent TableQA and denoising methods), dataset statistics, and an error analysis that isolates the contribution of each denoising stage and the rollback mechanism.
minor comments (2)
  1. Define acronyms (TableQA, LLM) on first use and ensure consistent notation for 'evidence tree' versus 'evidence-based' throughout.
  2. [Figure 1] Figure captions and the evidence-tree diagram should explicitly annotate the rollback operation and the intermediate table states being observed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have reviewed each major comment carefully and outline our responses and planned revisions below.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Evidence-based Question Denoising): The load-bearing assumption that decomposing questions into minimal semantic units and then dropping units via consistency/usability filters will retain every piece of information required for correct downstream reasoning is not obviously true for complex questions. Interdependent clauses that appear unusable in isolation can still be essential when combined; discarding them leaves the evidence tree with an incomplete set that later rollback or final reasoning cannot recover. This needs targeted validation (e.g., ablation on questions with known interdependencies or error analysis of cases where filtering removes critical context).

    Authors: We appreciate this observation on the risks of information loss from interdependencies in complex questions. Our consistency and usability filters are designed to retain evidence necessary for reasoning while removing noise, and the subsequent evidence tree with rollback provides a recovery path. Nevertheless, we agree that explicit validation is warranted. In the revision we will add a targeted ablation on questions containing known interdependent clauses together with an error analysis of filtering cases that remove critical context. revision: yes

  2. Referee: [§4] §4 (Experiments and Results): The abstract asserts 'outstanding performance' and 'extensive experiments,' yet the strength of the central claim depends on showing that the dual-denoising pipeline, rather than other factors, drives the gains. The manuscript should report concrete metrics, strong baselines (including recent TableQA and denoising methods), dataset statistics, and an error analysis that isolates the contribution of each denoising stage and the rollback mechanism.

    Authors: We agree that stronger isolation of the dual-denoising contributions is needed to support the central claims. The current experiments compare against several TableQA baselines on standard benchmarks, yet we will expand §4 in the revision to include additional concrete metrics, more recent TableQA and denoising baselines, fuller dataset statistics, and a dedicated error analysis that quantifies the incremental effect of question denoising, table pruning, and the rollback mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity detected in EnoTab dual-denoising derivation

full rationale

The paper presents EnoTab as an algorithmic framework: question decomposition into semantic units followed by consistency/usability filtering, then evidence-tree table pruning with rollback. These steps are defined procedurally and evaluated empirically on TableQA benchmarks. No equations or claims reduce a 'prediction' to a fitted input by construction, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled. The performance results are reported from experiments rather than derived tautologically from the method definition itself. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about semantic decomposition and pruning effectiveness; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Decomposing questions into minimal semantic units and filtering by consistency and usability criteria preserves all necessary information for reasoning.
    Invoked as the basis for Evidence-based Question Denoising in the abstract.
  • domain assumption Step-by-step table pruning guided by an evidence tree with rollback will produce a reliable sub-table without losing essential content.
    Central to Evidence Tree-guided Table Denoising and post-order node rollback mechanism.

pith-pipeline@v0.9.0 · 5783 in / 1292 out tokens · 119770 ms · 2026-05-21T21:33:12.140833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

    cs.CL 2026-04 unverdicted novelty 5.0

    ReQueR trains a single RL-based query refiner with an adaptive curriculum to decompose raw queries into structured logic, delivering 1.7-7.2% absolute gains on reasoning tasks across diverse LLMs and generalizing to u...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Nikhil Abhyankar, Vivek Gupta, Dan Roth, and Chandan K Reddy. 2025. H-star: Llm-driven hybrid sql-text adaptive reasoning on tables. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8841--8863

  2. [2]

    Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549

  3. [3]

    Si-An Chen, Lesly Miculicich, Julian Eisenschlos, Zifeng Wang, Zilong Wang, Yanfei Chen, Yasuhisa Fujii, Hsuan-Tien Lin, Chen-Yu Lee, and Tomas Pfister. 2024. Tablerag: Million-token table understanding with language models. Advances in Neural Information Processing Systems, 37:74899--74921

  4. [4]

    Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020. Tabfact : A large-scale dataset for table-based fact verification. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia

  5. [5]

    Smith, and Tao Yu

    Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. 2023. Binding language models in symbolic languages. ICLR, abs/2210.02875

  6. [6]

    Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2022. Turl: Table understanding through representation learning. ACM SIGMOD Record, 51(1):33--40

  7. [7]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  8. [8]

    Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, and Xiaoyong Du. 2022. Pasta: table-operations aware fact verification via sentence-table cloze pre-training. arXiv preprint arXiv:2211.02816

  9. [9]

    Jonathan Herzig, Pawe Krzysztof Nowak, Thomas M \"u ller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. Tapas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349

  10. [10]

    Rohit Khoja, Devanshu Gupta, Yanjie Fu, Dan Roth, and Vivek Gupta. 2025. Weaver: Interweaving sql and llm for table reasoning. arXiv preprint arXiv:2505.18961

  11. [11]

    Mike Lewis. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461

  12. [12]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74--81

  13. [13]

    Xin Lin, Zhenya Huang, Zhiqiang Zhang, Jun Zhou, and Enhong Chen. 2025. Explore what llm does not know in complex question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24585--24594

  14. [14]

    Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, and Jian-Guang Lou. 2021. Tapex: Table pre-training via learning a neural sql executor. arXiv preprint arXiv:2107.07653

  15. [15]

    Qingyang Mao, Qi Liu, Zhi Li, Mingyue Cheng, Zheng Zhang, and Rui Li. 2024. Potable: Programming standardly on table-based reasoning like a human analyst. arXiv preprint arXiv:2412.04272

  16. [16]

    Md Nahid and Davood Rafiei. 2024 a . Normtab: Improving symbolic reasoning in llms through tabular data normalization. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3569--3585

  17. [17]

    Md Nahid and Davood Rafiei. 2024 b . Tabsqlify: Enhancing reasoning capabilities of llms through table decomposition. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5725--5737

  18. [18]

    Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kry \'s ci \'n ski, Hailey Schoelkopf, Riley Kong, Xiangru Tang, et al. 2022. Fetaqa: Free-form table question answering. Transactions of the Association for Computational Linguistics, 10:35--49

  19. [19]

    R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5)

  20. [20]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318

  21. [21]

    Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305

  22. [22]

    Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O Arik. 2024. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql. arXiv preprint arXiv:2410.01943

  23. [23]

    Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the text-to-sql capabilities of large language models. arXiv preprint arXiv:2204.00498

  24. [24]

    Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2024. Table meets llm: Can large language models understand structured table data? a benchmark and empirical study. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 645--654

  25. [25]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

  26. [26]

    Lihan Wang, Bowen Qin, Binyuan Hui, Bowen Li, Min Yang, Bailin Wang, Binhua Li, Jian Sun, Fei Huang, Luo Si, et al. 2022 a . Proton: Probing schema linking information from pre-trained language models for text-to-sql parsing. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1889--1898

  27. [27]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022 b . Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

  28. [28]

    Yuxiang Wang, Junhao Gan, and Jianzhong Qi. 2025 a . Tabsd: Large free-form table question answering with sql-based table decomposition. arXiv preprint arXiv:2502.13422

  29. [29]

    Yuxiang Wang, Jianzhong Qi, and Junhao Gan. 2025 b . Accurate and regret-aware numerical problem solver for tabular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 12775--12783

  30. [30]

    Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. 2024. Chain-of-table: Evolving tables in the reasoning chain for table understanding. ICLR

  31. [31]

    Zirui Wu and Yansong Feng. 2024. Protrix: Building models for planning and reasoning over tables with sentence context. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4378--4406

  32. [32]

    Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. 2023. Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning. In Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 174--184

  33. [33]

    Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. Tabert: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314

  34. [34]

    Peiying Yu, Guoxin Chen, and Jingjing Wang. 2025. https://doi.org/10.18653/v1/2025.acl-long.853 Table-critic: A multi-agent framework for collaborative criticism and refinement in table reasoning . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17432--17451, Vienna, Austria. Associ...

  35. [35]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

  36. [36]

    Xuanliang Zhang, Dingzirui Wang, Longxu Dou, Qingfu Zhu, and Wanxiang Che. 2025. A survey of table reasoning with large language models. Frontiers of Computer Science, 19(9):199348

  37. [37]

    Yunjia Zhang, Jordan Henkel, Avrilia Floratou, Joyce Cahoon, Shaleen Deep, and Jignesh M Patel. 2023. Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815

  38. [38]

    Yilun Zhao, Lyuhao Chen, Arman Cohan, and Chen Zhao. 2024. Tapera: enhancing faithfulness and interpretability in long-form table qa by content planning and execution-based reasoning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12824--12840

  39. [39]

    Wei Zhou, Mohsen Mesgar, Annemarie Friedrich, and Heike Adel. 2025. Efficient multi-agent collaboration with tool use for online planning in complex table question answering. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 945--968

  40. [40]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  41. [41]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...