CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

Changzai Pan; Chenshuo Pan; Jiayi Liang; Jie Zhang; Shuangyong Song; Yongxiang Li; Yujie Mao; Yu Zhao; Zhenhe Wu; Zhongjiang He

arxiv: 2606.06842 · v1 · pith:WWKQWFAFnew · submitted 2026-06-05 · 💻 cs.CL

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

Chenshuo Pan , Yu Zhao , Jie Zhang , Changzai Pan , Zhenhe Wu , Jiayi Liang , Yujie Mao , Shuangyong Song

show 2 more authors

Yongxiang Li Zhongjiang He

This is my paper

Pith reviewed 2026-06-27 22:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords counterfactual reasoningtabular question answeringfact verificationlarge language modelstable reasoningbidirectional verificationWikiTQTabFact

0 comments

The pith

CRAFT improves tabular QA and fact verification by constructing counterfactual statement variants and weighting evidence from both original and alternative reasoning paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified framework called CRAFT that converts single-direction table reasoning into bidirectional verification. It builds declarative statements from the input, generates their counterfactual counterparts, pulls evidence along both paths, and combines the signals through a weighting step to decide the answer. This is shown to raise performance on WikiTQ and TabFact, with bigger gains on harder questions and smaller differences across LLM backbones. The claim matters because current LLM methods often fail when they cannot test alternatives, and the new process directly supplies those alternatives inside the same model call sequence. If the method holds, structured reasoning tasks move from one-pass inference to explicit hypothesis testing.

Core claim

CRAFT reformulates tabular question answering and fact verification as a general bidirectional verification process. Declarative statements and their counterfactual variants are constructed explicitly; evidence is extracted from reasoning along both the original and counterfactual paths; and the two streams are integrated by a weighted mechanism to produce the final answer. This process is shown to outperform single-direction baselines on WikiTQ and TabFact while narrowing gaps between different backbone LLMs.

What carries the argument

Bidirectional verification that generates counterfactual variants of declarative statements and integrates weighted evidence from both original and alternative reasoning paths.

If this is right

Accuracy rises on WikiTQ and TabFact, especially for complex multi-step questions.
Performance differences shrink across different LLM backbones.
Single-direction inference limits are reduced by explicit alternative-hypothesis testing.
The same framework applies to both question answering and fact verification without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The weighting step could be replaced by learned fusion to test whether hand-designed weights are necessary.
Counterfactual construction might transfer to non-table structured sources such as knowledge graphs or code repositories.
The method could be combined with self-consistency sampling to further increase robustness on long tables.

Load-bearing premise

Constructing and reasoning over counterfactual variants, followed by weighted integration, will produce more accurate answers than single-direction reasoning without adding new errors or biases.

What would settle it

Running the full CRAFT pipeline on WikiTQ or TabFact with a fixed LLM backbone and finding that accuracy is equal to or lower than the single-direction baseline.

Figures

Figures reproduced from arXiv: 2606.06842 by Changzai Pan, Chenshuo Pan, Jiayi Liang, Jie Zhang, Shuangyong Song, Yongxiang Li, Yujie Mao, Yu Zhao, Zhenhe Wu, Zhongjiang He.

**Figure 3.** Figure 3: WikiTQ and TabFact accuracy as the number [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 2.** Figure 2: Repeated-Sampling analyses on WikiTQ/TabFact. (a) Accuracy comparison between voting methods and CRAFT. (b) Ideal(Pass@K) accuracy upper bounds compared with our method. their frequencies; and (2) Confidence-Weighted (CW), which also samples N answers but aggregates them by summing exp(score) for each unique candidate and choosing the answer with the highest total weight. Here we choose N = 3 to allow a… view at source ↗

**Figure 4.** Figure 4: Performance across different table sizes. We partition tables into three size groups with thresholds: For WikiTQ,small (<2000 tokens), medium (2000– 4000), and large (>4000); For TabFact, small (<500 tokens), medium (500–800), and large (>800). effectively track, integrate, and reason over long input contexts (Liu et al., 2023a; Ye et al., 2023). To evaluate the impact of table size on performance, we com… view at source ↗

**Figure 5.** Figure 5: A Case Study Comparing Self-Critique and Counterfactual Reasoning Paths [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: A Case Study Showing how CRAFT get correct answer when Both Reasoning Paths start with a wrong [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Rewriter Prompt Template [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Reverser Prompt Template [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Extractor Prompt Template [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Rethinker Prompt Template [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

read the original abstract

Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRAFT introduces a bidirectional counterfactual approach unifying tabular QA and fact verification, but the abstract's lack of numbers and method details leaves the performance claims hard to assess.

read the letter

Hi,

The main takeaway is that this paper puts forward CRAFT as a single framework that turns both tabular question answering and fact verification into a bidirectional process: it builds declarative statements plus their counterfactual variants, pulls evidence along both paths, and combines them with weighting to produce the answer. This is positioned as fixing the limits of single-direction reasoning in LLMs on structured tables.

The combination of the two tasks under one explicit counterfactual construction and integration step looks new relative to prior single-direction work. The paper does a clear job spelling out the motivation and sketching how the paths and weighting are meant to work at a high level.

The soft spots sit in the evidence. The abstract states consistent gains over baselines on WikiTQ and TabFact, with bigger lifts on complex questions and smaller gaps across different LLMs, yet supplies no actual scores, no description of how the counterfactuals are generated in practice, no weighting formula, and no ablations. Without those pieces it is difficult to judge whether the method delivers the claimed robustness or simply adds noise. The assumption that bidirectional paths will improve accuracy without introducing new errors is reasonable on paper but remains untested in the description given.

This is for researchers focused on LLM table reasoning. Someone looking for concrete ideas on counterfactual reformulations could pull useful structure from it, but anyone evaluating results will need the full experiments and code. It deserves a serious referee because the problem is real, the framing is coherent, and the tasks matter, even if the current write-up is light on supporting data.

I'd send it to peer review so the experimental side can be checked properly.

Referee Report

2 major / 0 minor

Summary. The paper proposes CRAFT, a unified counterfactual reasoning framework for tabular question answering and fact verification. It reformulates the tasks as a bidirectional verification process by explicitly constructing declarative statements and their counterfactual variants, extracting evidence along both original and counterfactual reasoning paths, and integrating the evidence via a weighted mechanism to produce the final answer. The central empirical claims are that this approach consistently outperforms representative baselines on WikiTQ and TabFact (with especially large gains on complex QA) and significantly reduces performance gaps across different LLM backbones.

Significance. If the empirical results hold and are supported by proper ablations and error analysis, the work could be moderately significant for table reasoning, as it offers a concrete mechanism to move beyond single-direction inference and potentially improve robustness and consistency across LLMs. The promised public code release would strengthen reproducibility.

major comments (2)

[Abstract] Abstract: the abstract asserts performance improvements, gap reduction, and superiority on complex QA but supplies no quantitative results, implementation details, ablation studies, or error analysis, so it is impossible to determine whether the data actually support the stated claims.
[Abstract] The central claim depends on the premise that bidirectional counterfactual construction plus weighted integration produces more accurate answers without introducing new sources of error or bias; no evidence is provided to evaluate whether this premise holds or whether the weighting step is robust.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and comments. We address the two major comments on the abstract below, noting that the abstract provides a high-level summary while the supporting quantitative results, ablations, and analyses appear in the main manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract asserts performance improvements, gap reduction, and superiority on complex QA but supplies no quantitative results, implementation details, ablation studies, or error analysis, so it is impossible to determine whether the data actually support the stated claims.

Authors: Abstracts are designed to be concise overviews and conventionally omit detailed numbers, implementation specifics, ablations, and error analyses. The manuscript supplies these in full: Section 4 reports the performance gains on WikiTQ and TabFact (including larger gains on complex questions), Section 5 contains the ablation studies on each component, and Section 6 presents the error analysis and LLM-gap reduction results. These sections directly support the claims summarized in the abstract. revision: no
Referee: [Abstract] The central claim depends on the premise that bidirectional counterfactual construction plus weighted integration produces more accurate answers without introducing new sources of error or bias; no evidence is provided to evaluate whether this premise holds or whether the weighting step is robust.

Authors: The manuscript evaluates this premise through controlled experiments. Main results show consistent accuracy gains from the bidirectional paths and weighted integration over single-direction baselines. Ablations isolate the contribution of counterfactual construction and the weighting mechanism, while error analysis and cross-LLM consistency metrics demonstrate that the approach reduces rather than introduces errors or bias. These evaluations appear in Sections 4–6. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces CRAFT as an empirical framework that constructs counterfactual variants, extracts bidirectional evidence, and applies weighted integration for tabular QA and fact verification. All central claims rest on reported experimental gains versus baselines on WikiTQ and TabFact rather than any mathematical derivation, self-referential definition, or fitted parameter renamed as a prediction. No equations, uniqueness theorems, or self-citation chains appear in the abstract or method description that would reduce the target result to its own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No concrete free parameters, axioms, or invented entities are described in the abstract; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5743 in / 1099 out tokens · 25970 ms · 2026-06-27T22:18:27.414047+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 34 canonical work pages

[1]

Sengamedu and Christos Faloutsos , journal=

Xi Fang and Weijie Xu and Fiona Anting Tan and Ziqing Hu and Jiani Zhang and Yanjun Qi and Srinivasan H. Sengamedu and Christos Faloutsos , journal=. Large Language Models (. 2024 , url=

2024
[2]

Transformers for Tabular Data Representation: A Survey of Models and Applications

Badaro, Gilbert and Saeed, Mohammed and Papotti, Paolo. Transformers for Tabular Data Representation: A Survey of Models and Applications. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00544

work page doi:10.1162/tacl_a_00544 2023
[3]

LLM s instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern \'a ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and Martins, Andre and Mondorf, Philipp and Neplenbroek, Vera and Pezzelle, Sandro and Plank, Barbara and Schlangen, David and Suglia, Alessandro a...

work page doi:10.18653/v1/2025.acl-short.20 2025
[4]

2025 , howpublished =

OpenAI , title =. 2025 , howpublished =

2025
[5]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
[6]

2020 , eprint=

TabFact: A Large-scale Dataset for Table-based Fact Verification , author=. 2020 , eprint=

2020
[7]

F in QA : A Dataset of Numerical Reasoning over Financial Data

Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021...

work page doi:10.18653/v1/2021.emnlp-main.300 2021
[8]

2023 , eprint=

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. 2023 , eprint=

2023
[9]

2023 , eprint=

Binding Language Models in Symbolic Languages , author=. 2023 , eprint=

2023
[10]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021
[11]

Empowering Language Understanding with Counterfactual Reasoning

Feng, Fuli and Zhang, Jizhi and He, Xiangnan and Zhang, Hanwang and Chua, Tat-Seng. Empowering Language Understanding with Counterfactual Reasoning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.196

work page doi:10.18653/v1/2021.findings-acl.196 2021
[12]

T a P as: Weakly Supervised Table Parsing via Pre-training

Herzig, Jonathan and Nowak, Pawel Krzysztof and M. T a P as: Weakly Supervised Table Parsing via Pre-training. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.398

work page doi:10.18653/v1/2020.acl-main.398 2020
[13]

Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation

Jeong, Jiwon and Jang, Hyeju and Park, Hogun. Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.384

work page doi:10.18653/v1/2025.findings-naacl.384 2025
[14]

2024 , eprint=

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding , author=. 2024 , eprint=

2024
[15]

Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models

Kim, Jongho and Hwang, Seung-won. Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.97

work page doi:10.18653/v1/2025.acl-short.97 2025
[16]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[17]

Counterfactual reasoning: Testing language models' understanding of hypothetical scenarios

Li, Jiaxuan and Yu, Lang and Ettinger, Allyson. Counterfactual reasoning: Testing language models' understanding of hypothetical scenarios. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.70

work page doi:10.18653/v1/2023.acl-short.70 2023
[18]

G raph OTTER : Evolving LLM -based Graph Reasoning for Complex Table Question Answering

Li, Qianlong and Huang, Chen and Li, Shuai and Xiang, Yuanxin and Xiong, Deng and Lei, Wenqiang. G raph OTTER : Evolving LLM -based Graph Reasoning for Complex Table Question Answering. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[19]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004
[20]

From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods , journal =

Jixiong Liu and Yoan Chabot and Raphaël Troncy and Viet-Phi Huynh and Thomas Labbé and Pierre Monnin , keywords =. From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.websem.2022.100761 , url =

work page doi:10.1016/j.websem.2022.100761 2023
[21]

2022 , eprint=

TAPEX: Table Pre-training via Learning a Neural SQL Executor , author=. 2022 , eprint=

2022
[22]

Rethinking Tabular Data Understanding with Large Language Models

Liu, Tianyang and Wang, Fei and Chen, Muhao. Rethinking Tabular Data Understanding with Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.26

work page doi:10.18653/v1/2024.naacl-long.26 2024
[23]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[24]

Large language model for table processing: a survey , journal =

Weizheng LU Jing ZHANG Ju FAN Zihao FU Yueguo CHEN Xiaoyong DU , keywords =. Large language model for table processing: a survey , journal =. 2025 , issn =. doi:https://doi.org/10.1007/s11704-024-40763-6 , url =

work page doi:10.1007/s11704-024-40763-6 2025
[25]

2025 , eprint=

PoTable: Towards Systematic Thinking via Stage-oriented Plan-then-Execute Reasoning on Tables , author=. 2025 , eprint=

2025
[26]

T ab SQL ify: Enhancing Reasoning Capabilities of LLM s Through Table Decomposition

Nahid, Md Mahadi Hasan and Rafiei, Davood. T ab SQL ify: Enhancing Reasoning Capabilities of LLM s Through Table Decomposition. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.320

work page doi:10.18653/v1/2024.naacl-long.320 2024
[27]

F e T a QA : Free-form Table Question Answering

Nan, Linyong and Hsieh, Chiachun and Mao, Ziming and Lin, Xi Victoria and Verma, Neha and Zhang, Rui and Kry \'s ci \'n ski, Wojciech and Schoelkopf, Hailey and Kong, Riley and Tang, Xiangru and Mutuma, Mutethia and Rosand, Ben and Trindade, Isabel and Bandaru, Renusree and Cunningham, Jacob and Xiong, Caiming and Radev, Dragomir. F e T a QA : Free-form T...

work page doi:10.1162/tacl_a_00446 2022
[28]

2023 , eprint=

LEVER: Learning to Verify Language-to-Code Generation with Execution , author=. 2023 , eprint=

2023
[29]

2026 , eprint=

ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios , author=. 2026 , eprint=

2026
[30]

F a VIQ : FA ct Verification from Information-seeking Questions

Park, Jungsoo and Min, Sewon and Kang, Jaewoo and Zettlemoyer, Luke and Hajishirzi, Hannaneh. F a VIQ : FA ct Verification from Information-seeking Questions. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.354

work page doi:10.18653/v1/2022.acl-long.354 2022
[31]

Compositional Semantic Parsing on Semi-Structured Tables

Pasupat, Panupong and Liang, Percy. Compositional Semantic Parsing on Semi-Structured Tables. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015. doi:10.3115/v1/P15-1142

work page doi:10.3115/v1/p15-1142 2015
[32]

B leu: a Method for Automatic Evaluation of Machine Translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[33]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[34]

2024 , eprint=

Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution , author=. 2024 , eprint=

2024
[35]

2024 , eprint=

TableGPT2: A Large Multimodal Model with Tabular Data Integration , author=. 2024 , eprint=

2024
[36]

TAP 4 LLM : Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning

Sui, Yuan and Zou, Jiaru and Zhou, Mengyu and He, Xinyi and Du, Lun and Han, Shi and Zhang, Dongmei. TAP 4 LLM : Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.603

work page doi:10.18653/v1/2024.findings-emnlp.603 2024
[37]

2025 , eprint=

Exchange of Perspective Prompting Enhances Reasoning in Large Language Models , author=. 2025 , eprint=

2025
[38]

Rating Roulette: Self-Inconsistency in LLM -As-A-Judge Frameworks

Haldar, Rajarshi and Hockenmaier, Julia. Rating Roulette: Self-Inconsistency in LLM -As-A-Judge Frameworks. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1361

work page doi:10.18653/v1/2025.findings-emnlp.1361 2025
[39]

QA - N at V er: Question Answering for Natural Logic-based Fact Verification

Aly, Rami and Strong, Marek and Vlachos, Andreas. QA - N at V er: Question Answering for Natural Logic-based Fact Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.521

work page doi:10.18653/v1/2023.emnlp-main.521 2023
[40]

Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Raina, Vyas and Liusie, Adian and Gales, Mark. Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.427

work page doi:10.18653/v1/2024.emnlp-main.427 2024
[41]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[42]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[43]

On Positional Bias of Faithfulness for Long-form Summarization

Wan, David and Vig, Jesse and Bansal, Mohit and Joty, Shafiq. On Positional Bias of Faithfulness for Long-form Summarization. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.442

work page doi:10.18653/v1/2025.naacl-long.442 2025
[44]

PNAS Nexus , volume =

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models , author =. PNAS Nexus , volume =. 2025 , month =

2025
[45]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

2022
[46]

Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation

Xie, Yuexiang and Sun, Fei and Deng, Yang and Li, Yaliang and Ding, Bolin. Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021. doi:10.18653/v1/2021.findings-emnlp.10

work page doi:10.18653/v1/2021.findings-emnlp.10 2021
[47]

Proceedings of INTERSPEECH 2023 , year =

Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models , author =. Proceedings of INTERSPEECH 2023 , year =

2023
[48]

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Ye, Yunhu and Hui, Binyuan and Yang, Min and Li, Binhua and Huang, Fei and Li, Yongbin , title =. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2023 , isbn =. doi:10.1145/3539618.3591708 , abstract =

work page doi:10.1145/3539618.3591708 2023
[49]

Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication

Yin, Zhangyue and Sun, Qiushi and Chang, Cheng and Guo, Qipeng and Dai, Junqi and Huang, Xuanjing and Qiu, Xipeng. Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.936

work page doi:10.18653/v1/2023.emnlp-main.936 2023
[50]

Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning

Yu, Peiying and Chen, Guoxin and Wang, Jingjing. Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.853

work page doi:10.18653/v1/2025.acl-long.853 2025
[51]

, title =

Zhang, Yunjia and Henkel, Jordan and Floratou, Avrilia and Cahoon, Joyce and Deep, Shaleen and Patel, Jignesh M. , title =. Proc. VLDB Endow. , month = apr, pages =. 2024 , issue_date =. doi:10.14778/3659437.3659452 , abstract =

work page doi:10.14778/3659437.3659452 2024
[52]

ALTER : Augmentation for Large-Table-Based Reasoning

Zhang, Han and Ma, Yuheng and Yang, Hanfang. ALTER : Augmentation for Large-Table-Based Reasoning. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.9

work page doi:10.18653/v1/2025.naacl-long.9 2025
[53]

Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives

Zhang, Xinliang Frederick and Beauchamp, Nick and Wang, Lu. Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.963

work page doi:10.18653/v1/2024.findings-emnlp.963 2024
[54]

T able LLM : Enabling Tabular Data Manipulation by LLM s in Real Office Usage Scenarios

Zhang, Xiaokang and Luo, Sijia and Zhang, Bohan and Ma, Zeyao and Zhang, Jing and Li, Yang and Li, Guanlin and Yao, Zijun and Xu, Kangli and Zhou, Jinchang and Zhang-Li, Daniel and Yu, Jifan and Zhao, Shu and Li, Juanzi and Tang, Jie. T able LLM : Enabling Tabular Data Manipulation by LLM s in Real Office Usage Scenarios. Findings of the Association for C...

work page doi:10.18653/v1/2025.findings-acl.538 2025
[55]

Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning

Zhang, Yanfang and Sun, Yiliu and Zhan, Yibing and Tao, Dapeng and Tao, Dacheng and Gong, Chen. Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[56]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[57]

Context-faithful Prompting for Large Language Models

Zhou, Wenxuan and Zhang, Sheng and Poon, Hoifung and Chen, Muhao. Context-faithful Prompting for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.968

work page doi:10.18653/v1/2023.findings-emnlp.968 2023
[58]

Critic- C o T : Boosting the Reasoning Abilities of Large Language Model via Chain-of-Thought Critic

Zheng, Xin and Lou, Jie and Cao, Boxi and Wen, Xueru and Ji, Yuqiu and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Zhang, Debing and Sun, Le. Critic- C o T : Boosting the Reasoning Abilities of Large Language Model via Chain-of-Thought Critic. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.89

work page doi:10.18653/v1/2025.findings-acl.89 2025
[59]

2025 , eprint=

Table-R1: Region-based Reinforcement Learning for Table Understanding , author=. 2025 , eprint=

2025
[60]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[61]

High-Quality Complex Text-to- SQL Data Generation through Chain-of-Verification

Zhang, Yuchen and Gao, Yuze and Chen, Bin and Li, Wenfeng and Sun, Shuo and Su, Jian. High-Quality Complex Text-to- SQL Data Generation through Chain-of-Verification. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 2025

2025
[62]

R e P anda: Pandas-powered Tabular Verification and Reasoning

Chegini, Atoosa and Rezaei, Keivan and Eghbalzadeh, Hamid and Feizi, Soheil. R e P anda: Pandas-powered Tabular Verification and Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1549

work page doi:10.18653/v1/2025.acl-long.1549 2025
[63]

2025 , eprint=

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , author=. 2025 , eprint=

2025
[64]

Vicinagearth , volume=

TableZoomer: a collaborative agent framework for large-scale table question answering , author=. Vicinagearth , volume=. 2025 , doi=

2025

[1] [1]

Sengamedu and Christos Faloutsos , journal=

Xi Fang and Weijie Xu and Fiona Anting Tan and Ziqing Hu and Jiani Zhang and Yanjun Qi and Srinivasan H. Sengamedu and Christos Faloutsos , journal=. Large Language Models (. 2024 , url=

2024

[2] [2]

Transformers for Tabular Data Representation: A Survey of Models and Applications

Badaro, Gilbert and Saeed, Mohammed and Papotti, Paolo. Transformers for Tabular Data Representation: A Survey of Models and Applications. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00544

work page doi:10.1162/tacl_a_00544 2023

[3] [3]

LLM s instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern \'a ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and Martins, Andre and Mondorf, Philipp and Neplenbroek, Vera and Pezzelle, Sandro and Plank, Barbara and Schlangen, David and Suglia, Alessandro a...

work page doi:10.18653/v1/2025.acl-short.20 2025

[4] [4]

2025 , howpublished =

OpenAI , title =. 2025 , howpublished =

2025

[5] [5]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

[6] [6]

2020 , eprint=

TabFact: A Large-scale Dataset for Table-based Fact Verification , author=. 2020 , eprint=

2020

[7] [7]

F in QA : A Dataset of Numerical Reasoning over Financial Data

Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021...

work page doi:10.18653/v1/2021.emnlp-main.300 2021

[8] [8]

2023 , eprint=

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. 2023 , eprint=

2023

[9] [9]

2023 , eprint=

Binding Language Models in Symbolic Languages , author=. 2023 , eprint=

2023

[10] [10]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021

[11] [11]

Empowering Language Understanding with Counterfactual Reasoning

Feng, Fuli and Zhang, Jizhi and He, Xiangnan and Zhang, Hanwang and Chua, Tat-Seng. Empowering Language Understanding with Counterfactual Reasoning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.196

work page doi:10.18653/v1/2021.findings-acl.196 2021

[12] [12]

T a P as: Weakly Supervised Table Parsing via Pre-training

Herzig, Jonathan and Nowak, Pawel Krzysztof and M. T a P as: Weakly Supervised Table Parsing via Pre-training. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.398

work page doi:10.18653/v1/2020.acl-main.398 2020

[13] [13]

Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation

Jeong, Jiwon and Jang, Hyeju and Park, Hogun. Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.384

work page doi:10.18653/v1/2025.findings-naacl.384 2025

[14] [14]

2024 , eprint=

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding , author=. 2024 , eprint=

2024

[15] [15]

Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models

Kim, Jongho and Hwang, Seung-won. Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.97

work page doi:10.18653/v1/2025.acl-short.97 2025

[16] [16]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[17] [17]

Counterfactual reasoning: Testing language models' understanding of hypothetical scenarios

Li, Jiaxuan and Yu, Lang and Ettinger, Allyson. Counterfactual reasoning: Testing language models' understanding of hypothetical scenarios. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.70

work page doi:10.18653/v1/2023.acl-short.70 2023

[18] [18]

G raph OTTER : Evolving LLM -based Graph Reasoning for Complex Table Question Answering

Li, Qianlong and Huang, Chen and Li, Shuai and Xiang, Yuanxin and Xiong, Deng and Lei, Wenqiang. G raph OTTER : Evolving LLM -based Graph Reasoning for Complex Table Question Answering. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025

[19] [19]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004

[20] [20]

From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods , journal =

Jixiong Liu and Yoan Chabot and Raphaël Troncy and Viet-Phi Huynh and Thomas Labbé and Pierre Monnin , keywords =. From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.websem.2022.100761 , url =

work page doi:10.1016/j.websem.2022.100761 2023

[21] [21]

2022 , eprint=

TAPEX: Table Pre-training via Learning a Neural SQL Executor , author=. 2022 , eprint=

2022

[22] [22]

Rethinking Tabular Data Understanding with Large Language Models

Liu, Tianyang and Wang, Fei and Chen, Muhao. Rethinking Tabular Data Understanding with Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.26

work page doi:10.18653/v1/2024.naacl-long.26 2024

[23] [23]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[24] [24]

Large language model for table processing: a survey , journal =

Weizheng LU Jing ZHANG Ju FAN Zihao FU Yueguo CHEN Xiaoyong DU , keywords =. Large language model for table processing: a survey , journal =. 2025 , issn =. doi:https://doi.org/10.1007/s11704-024-40763-6 , url =

work page doi:10.1007/s11704-024-40763-6 2025

[25] [25]

2025 , eprint=

PoTable: Towards Systematic Thinking via Stage-oriented Plan-then-Execute Reasoning on Tables , author=. 2025 , eprint=

2025

[26] [26]

T ab SQL ify: Enhancing Reasoning Capabilities of LLM s Through Table Decomposition

Nahid, Md Mahadi Hasan and Rafiei, Davood. T ab SQL ify: Enhancing Reasoning Capabilities of LLM s Through Table Decomposition. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.320

work page doi:10.18653/v1/2024.naacl-long.320 2024

[27] [27]

F e T a QA : Free-form Table Question Answering

Nan, Linyong and Hsieh, Chiachun and Mao, Ziming and Lin, Xi Victoria and Verma, Neha and Zhang, Rui and Kry \'s ci \'n ski, Wojciech and Schoelkopf, Hailey and Kong, Riley and Tang, Xiangru and Mutuma, Mutethia and Rosand, Ben and Trindade, Isabel and Bandaru, Renusree and Cunningham, Jacob and Xiong, Caiming and Radev, Dragomir. F e T a QA : Free-form T...

work page doi:10.1162/tacl_a_00446 2022

[28] [28]

2023 , eprint=

LEVER: Learning to Verify Language-to-Code Generation with Execution , author=. 2023 , eprint=

2023

[29] [29]

2026 , eprint=

ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios , author=. 2026 , eprint=

2026

[30] [30]

F a VIQ : FA ct Verification from Information-seeking Questions

Park, Jungsoo and Min, Sewon and Kang, Jaewoo and Zettlemoyer, Luke and Hajishirzi, Hannaneh. F a VIQ : FA ct Verification from Information-seeking Questions. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.354

work page doi:10.18653/v1/2022.acl-long.354 2022

[31] [31]

Compositional Semantic Parsing on Semi-Structured Tables

Pasupat, Panupong and Liang, Percy. Compositional Semantic Parsing on Semi-Structured Tables. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015. doi:10.3115/v1/P15-1142

work page doi:10.3115/v1/p15-1142 2015

[32] [32]

B leu: a Method for Automatic Evaluation of Machine Translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[33] [33]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[34] [34]

2024 , eprint=

Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution , author=. 2024 , eprint=

2024

[35] [35]

2024 , eprint=

TableGPT2: A Large Multimodal Model with Tabular Data Integration , author=. 2024 , eprint=

2024

[36] [36]

TAP 4 LLM : Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning

Sui, Yuan and Zou, Jiaru and Zhou, Mengyu and He, Xinyi and Du, Lun and Han, Shi and Zhang, Dongmei. TAP 4 LLM : Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.603

work page doi:10.18653/v1/2024.findings-emnlp.603 2024

[37] [37]

2025 , eprint=

Exchange of Perspective Prompting Enhances Reasoning in Large Language Models , author=. 2025 , eprint=

2025

[38] [38]

Rating Roulette: Self-Inconsistency in LLM -As-A-Judge Frameworks

Haldar, Rajarshi and Hockenmaier, Julia. Rating Roulette: Self-Inconsistency in LLM -As-A-Judge Frameworks. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1361

work page doi:10.18653/v1/2025.findings-emnlp.1361 2025

[39] [39]

QA - N at V er: Question Answering for Natural Logic-based Fact Verification

Aly, Rami and Strong, Marek and Vlachos, Andreas. QA - N at V er: Question Answering for Natural Logic-based Fact Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.521

work page doi:10.18653/v1/2023.emnlp-main.521 2023

[40] [40]

Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Raina, Vyas and Liusie, Adian and Gales, Mark. Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.427

work page doi:10.18653/v1/2024.emnlp-main.427 2024

[41] [41]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[42] [42]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[43] [43]

On Positional Bias of Faithfulness for Long-form Summarization

Wan, David and Vig, Jesse and Bansal, Mohit and Joty, Shafiq. On Positional Bias of Faithfulness for Long-form Summarization. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.442

work page doi:10.18653/v1/2025.naacl-long.442 2025

[44] [44]

PNAS Nexus , volume =

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models , author =. PNAS Nexus , volume =. 2025 , month =

2025

[45] [45]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

2022

[46] [46]

Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation

Xie, Yuexiang and Sun, Fei and Deng, Yang and Li, Yaliang and Ding, Bolin. Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021. doi:10.18653/v1/2021.findings-emnlp.10

work page doi:10.18653/v1/2021.findings-emnlp.10 2021

[47] [47]

Proceedings of INTERSPEECH 2023 , year =

Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models , author =. Proceedings of INTERSPEECH 2023 , year =

2023

[48] [48]

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Ye, Yunhu and Hui, Binyuan and Yang, Min and Li, Binhua and Huang, Fei and Li, Yongbin , title =. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2023 , isbn =. doi:10.1145/3539618.3591708 , abstract =

work page doi:10.1145/3539618.3591708 2023

[49] [49]

Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication

Yin, Zhangyue and Sun, Qiushi and Chang, Cheng and Guo, Qipeng and Dai, Junqi and Huang, Xuanjing and Qiu, Xipeng. Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.936

work page doi:10.18653/v1/2023.emnlp-main.936 2023

[50] [50]

Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning

Yu, Peiying and Chen, Guoxin and Wang, Jingjing. Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.853

work page doi:10.18653/v1/2025.acl-long.853 2025

[51] [51]

, title =

Zhang, Yunjia and Henkel, Jordan and Floratou, Avrilia and Cahoon, Joyce and Deep, Shaleen and Patel, Jignesh M. , title =. Proc. VLDB Endow. , month = apr, pages =. 2024 , issue_date =. doi:10.14778/3659437.3659452 , abstract =

work page doi:10.14778/3659437.3659452 2024

[52] [52]

ALTER : Augmentation for Large-Table-Based Reasoning

Zhang, Han and Ma, Yuheng and Yang, Hanfang. ALTER : Augmentation for Large-Table-Based Reasoning. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.9

work page doi:10.18653/v1/2025.naacl-long.9 2025

[53] [53]

Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives

Zhang, Xinliang Frederick and Beauchamp, Nick and Wang, Lu. Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.963

work page doi:10.18653/v1/2024.findings-emnlp.963 2024

[54] [54]

T able LLM : Enabling Tabular Data Manipulation by LLM s in Real Office Usage Scenarios

Zhang, Xiaokang and Luo, Sijia and Zhang, Bohan and Ma, Zeyao and Zhang, Jing and Li, Yang and Li, Guanlin and Yao, Zijun and Xu, Kangli and Zhou, Jinchang and Zhang-Li, Daniel and Yu, Jifan and Zhao, Shu and Li, Juanzi and Tang, Jie. T able LLM : Enabling Tabular Data Manipulation by LLM s in Real Office Usage Scenarios. Findings of the Association for C...

work page doi:10.18653/v1/2025.findings-acl.538 2025

[55] [55]

Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning

Zhang, Yanfang and Sun, Yiliu and Zhan, Yibing and Tao, Dapeng and Tao, Dacheng and Gong, Chen. Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025

[56] [56]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[57] [57]

Context-faithful Prompting for Large Language Models

Zhou, Wenxuan and Zhang, Sheng and Poon, Hoifung and Chen, Muhao. Context-faithful Prompting for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.968

work page doi:10.18653/v1/2023.findings-emnlp.968 2023

[58] [58]

Critic- C o T : Boosting the Reasoning Abilities of Large Language Model via Chain-of-Thought Critic

Zheng, Xin and Lou, Jie and Cao, Boxi and Wen, Xueru and Ji, Yuqiu and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Zhang, Debing and Sun, Le. Critic- C o T : Boosting the Reasoning Abilities of Large Language Model via Chain-of-Thought Critic. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.89

work page doi:10.18653/v1/2025.findings-acl.89 2025

[59] [59]

2025 , eprint=

Table-R1: Region-based Reinforcement Learning for Table Understanding , author=. 2025 , eprint=

2025

[60] [60]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z

[61] [61]

High-Quality Complex Text-to- SQL Data Generation through Chain-of-Verification

Zhang, Yuchen and Gao, Yuze and Chen, Bin and Li, Wenfeng and Sun, Shuo and Su, Jian. High-Quality Complex Text-to- SQL Data Generation through Chain-of-Verification. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 2025

2025

[62] [62]

R e P anda: Pandas-powered Tabular Verification and Reasoning

Chegini, Atoosa and Rezaei, Keivan and Eghbalzadeh, Hamid and Feizi, Soheil. R e P anda: Pandas-powered Tabular Verification and Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1549

work page doi:10.18653/v1/2025.acl-long.1549 2025

[63] [63]

2025 , eprint=

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , author=. 2025 , eprint=

2025

[64] [64]

Vicinagearth , volume=

TableZoomer: a collaborative agent framework for large-scale table question answering , author=. Vicinagearth , volume=. 2025 , doi=

2025