pith. sign in

arxiv: 2502.11008 · v2 · submitted 2025-02-16 · 💻 cs.CL

CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models

Pith reviewed 2026-05-23 03:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords counterfactual reasoninglarge language modelsbenchmarkcausal reasoningiterative reasoningbacktracking
0
0 comments X

The pith

Large language models perform near random guessing on formal counterfactual reasoning but improve with iterative guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can follow explicit formal rules to reason about what would happen in alternative scenarios. It presents CounterBench, a collection of one thousand questions built on diverse causal graphs and invented names to reduce reliance on memorized facts. Experiments reveal that most models score at chance levels. The authors introduce the CoIn method, which directs models to reason iteratively and backtrack when needed, raising performance on the benchmark for multiple models.

Core claim

Counterfactual reasoning using formal rules remains difficult for large language models, with performance often matching random guessing on the CounterBench dataset of one thousand questions featuring varied causal structures and nonsensical names. The CoIn paradigm, which prompts models to perform iterative reasoning with backtracking, leads to significant and consistent improvements across different large language models.

What carries the argument

The CoIn method, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions.

If this is right

  • Counterfactual performance can be boosted in existing models without additional training.
  • Results hold across multiple different large language models.
  • The benchmark design with nonsensical names and varied graphs helps isolate formal inference.
  • Models struggle more on complex graph structures and certain question types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • CoIn might extend to other structured reasoning tasks beyond counterfactuals.
  • Combining CoIn with retrieval or external tools could further enhance causal reasoning.
  • Testing on real-world scenarios could reveal whether the gains transfer outside the benchmark.

Load-bearing premise

The questions in CounterBench test only formal counterfactual inference and do not allow models to succeed through patterns learned during pretraining.

What would settle it

Running the models on a fresh set of counterfactual questions with novel causal graphs and names, confirming that accuracy stays near random without CoIn and rises with it.

Figures

Figures reproduced from arXiv: 2502.11008 by Jing Ma, Ruixiang Tang, Vivek K.Singh, Yuefei Chen.

Figure 1
Figure 1. Figure 1: Comparison of accuracy scores on the Coun [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our framework. We create CounterBench, a dataset featuring four types of counterfactual [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Error Analysis of CausalCoT. 4 Proposed Reasoning Strategy As discussed in Section 3, the primary challenge for LLMs is to minimize incorrect inferences, which are a major source of errors. To tackle the challenges in counterfactual inference that current LLMs face, we introduce an enhanced reasoning framework. This framework utilizes evaluation and backtracking capabilities within a multi-step causal chai… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy comparison between Standard, CoIn, and CausalCoT in Anticommonsense and Com￾monsense Dataset. 6 Related Work Counterfactual Reasoning. Counterfactual rea￾soning explores how outcomes change when cer￾tain variables are altered from their historical states. In Structural Causal Models (SCMs), Pearl’s (Pearl, 2009) “surgery” and do-calculus provide system￾atic ways to infer intervention outcomes, hig… view at source ↗
Figure 5
Figure 5. Figure 5: The prompt design of CoIn [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Anti-commonsense Example Question: Imagine a self-contained, hypo￾thetical world with only the following condi￾tions, and without any unmentioned factors or causal relationships: The man in the room has a direct effect on room. The candle has a direct effect on room. We know that blowing out the candle and candle with wax causes dark room. We observed the candle has wax. Would the room is dark if not blowi… view at source ↗
Figure 6
Figure 6. Figure 6: Error Analysis comparison between Our Method and CausalCoT. A.6 CLADDER Dataset Example Two examples are generated from the CLADDER dataset. It is shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 10
Figure 10. Figure 10: Error Analysis for Babbage-002 in Causal [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CausalCoT Instruction Example [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: CoIn Instruction Example [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Conclusion Error Example Response The correct answer is (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … view at source ↗
Figure 14
Figure 14. Figure 14: Type Mismatch [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CounterBench, a benchmark dataset of 1K counterfactual reasoning questions using formal rules, varying difficulty levels, diverse causal graph structures, distinct question types, and multiple nonsensical name variants to evaluate LLMs on formal counterfactual inference (distinct from commonsense causal reasoning). Experiments indicate that most LLMs perform at levels comparable to random guessing on these tasks. The authors propose CoIn, a reasoning paradigm that guides LLMs through iterative reasoning and backtracking, and report that it significantly improves performance across different LLMs. The dataset is released publicly.

Significance. If the benchmark successfully isolates formal counterfactual inference without pretraining leakage or surface heuristics, the near-random baseline performance and consistent gains from CoIn would highlight a core limitation in current LLMs' causal reasoning and provide a practical improvement method with relevance to AI safety and robust reasoning systems. The public dataset release supports reproducibility and further work.

major comments (2)
  1. [Abstract] Abstract: The central claims that LLMs perform at levels comparable to random guessing and that CoIn significantly improves performance lack any quantitative results, error bars, statistical tests, or details on validation of difficulty levels and graph structures. This is load-bearing for the empirical findings.
  2. [Abstract] Abstract / Dataset Design: The design uses nonsensical name variants and varied causal graph structures to isolate formal counterfactual rules from pretraining leakage or surface patterns, but no control experiments (e.g., accuracy delta on real vs. nonsensical names, performance on held-out graph structures, or comparison to post-cutoff models) are reported to verify this isolation. This directly affects whether the near-random results can be interpreted as evidence of failure on formal inference.
minor comments (2)
  1. [Abstract] Abstract contains a missing space: 'LLMs.Our dataset' should be 'LLMs. Our dataset'.
  2. [Abstract] Abstract: 'enhance LLM's counterfactual reasoning ability' should read 'enhance LLMs' counterfactual reasoning abilities' for grammatical consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and validation of the benchmark design.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims that LLMs perform at levels comparable to random guessing and that CoIn significantly improves performance lack any quantitative results, error bars, statistical tests, or details on validation of difficulty levels and graph structures. This is load-bearing for the empirical findings.

    Authors: We agree that the abstract should include quantitative support for the claims. In the revised manuscript we will update the abstract to report key metrics such as average LLM accuracy (near random baseline levels), the magnitude of CoIn gains, reference to error bars from repeated runs, and brief notes on how difficulty levels and graph structures were constructed and validated. revision: yes

  2. Referee: [Abstract] Abstract / Dataset Design: The design uses nonsensical name variants and varied causal graph structures to isolate formal counterfactual rules from pretraining leakage or surface patterns, but no control experiments (e.g., accuracy delta on real vs. nonsensical names, performance on held-out graph structures, or comparison to post-cutoff models) are reported to verify this isolation. This directly affects whether the near-random results can be interpreted as evidence of failure on formal inference.

    Authors: We acknowledge the value of explicit controls. While nonsensical names and diverse structures were chosen to minimize leakage, we did not report direct comparisons. We will add control analyses in the revision, including accuracy differences between real and nonsensical name variants and results on held-out graph structures, to better substantiate the isolation claim. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark and method evaluation with no self-referential derivations or fitted predictions

full rationale

The paper introduces CounterBench as an external dataset of 1K questions and evaluates LLMs plus the CoIn method on it. No equations, parameters, or first-principles derivations are present that reduce any claimed result to its own inputs by construction. The design choices (nonsensical names, graph variants) are presented as methodological safeguards rather than fitted quantities renamed as predictions. Self-citations, if any, are not load-bearing for the central empirical claims. The work is therefore self-contained against its stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work is an empirical benchmark and prompting study.

pith-pipeline@v0.9.0 · 5742 in / 1040 out tokens · 15796 ms · 2026-05-23T03:07:31.937032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation

    cs.LG 2026-01 conditional novelty 6.0

    Fine-tuned LLMs produce plausible counterfactuals for health interventions and recover 20% F1 via data augmentation in label-scarce sensor datasets.

  2. DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

    cs.CL 2026-04 unverdicted novelty 5.0

    DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Alwin. 2023. Understanding causal ai: Bridging the gap between correlation and causation. https://www.alwin.io/causal-ai. Accessed: 2025-01-06

  2. [2]

    Anthropic. 2024. Claude. https://www.anthropic.com/api. Accessed: 2025-01-06

  3. [3]

    Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Dushyant Singh Sengar, Mayank Jindal, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, and Aman Chadha. 2024. Cause and effect: Can large language models truly understand causality? In Proceedings of the AAAI Symposium Series, volume 4, pages 2--9

  4. [4]

    Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, and Manuel Gomez-Rodriguez. 2024. Counterfactual token generation in large language models. arXiv preprint arXiv:2409.17027

  5. [5]

    Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. 2020. Causalml: Python package for causal machine learning. arXiv preprint arXiv:2002.11631

  6. [6]

    DeepSeek . 2024. https://www.deepseek.com/ DeepSeek: AI-Powered Search Engine . Accessed: 2025-02-15

  7. [7]

    Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138--1158

  8. [8]

    Google. 2024. Gemini. https://gemini.google.com/. Accessed: 2025-01-06

  9. [9]

    Ian D Gow, David F Larcker, and Peter C Reiss. 2016. Causal inference in accounting research. Journal of Accounting Research, 54(2):477--523

  10. [10]

    Emilia Gvozdenovi \'c , Lucio Malvisi, Elisa Cinconze, Stijn Vansteelandt, Phoebe Nakanwagi, Emmanuel Aris, and Dominique Rosillon. 2021. Causal inference concepts applied to three observational studies in the context of vaccine development: from theory to practice. BMC Medical Research Methodology, 21:1--10

  11. [11]

    Kairong Han, Kun Kuang, Ziyu Zhao, Junjian Ye, and Fei Wu. 2024. Causal agent based on large language model. arXiv preprint arXiv:2408.06849

  12. [12]

    Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association, 81(396):945--960

  13. [13]

    Zhenyang Hua, Shuyue Xing, Huixing Jiang, Chen Wei, and Xiaojie Wang. 2024. Improving causal inference of large language models with scm tools. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 3--14. Springer

  14. [14]

    Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, LYU Zhiheng, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. 2023. Cladder: Assessing causal reasoning in language models. In Thirty-seventh conference on neural information processing systems

  15. [15]

    Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine learning, pages 3020--3029. PMLR

  16. [16]

    Atoosa Kasirzadeh and Andrew Smart. 2021. The use and misuse of counterfactuals in ethical machine learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 228--236

  17. [17]

    Emre K c man, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050

  18. [18]

    Lisa Koonce, Karen K Nelson, and Catherine M Shakespeare. 2011. Judging the relevance of fair value for financial instruments. The Accounting Review, 86(6):2075--2098

  19. [19]

    Parthasarathy Krishnamurthy and Anuradha Sivaraman. 2002. Counterfactual thinking and advertising responses. Journal of Consumer Research, 28(4):650--658

  20. [20]

    Evangelia Kyrimi, Somayyeh Mossadegh, Jared M Wohlgemut, Rebecca S Stoner, Nigel RM Tai, and William Marsh. 2025. Counterfactual reasoning using causal bayesian networks as a healthcare governance tool. International Journal of Medical Informatics, 193:105681

  21. [21]

    Jia Li and Xiang Li. 2024. https://arxiv.org/abs/2307.16387 Relation-first modeling paradigm for causal representation learning toward the development of agi . Preprint, arXiv:2307.16387

  22. [22]

    Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, et al. 2024. Mapping the increasing use of llms in scientific papers. arXiv preprint arXiv:2404.01268

  23. [23]

    Jinxin Liu, Shulin Cao, Jiaxin Shi, Tingjian Zhang, Lunyiu Nie, Linmei Hu, Lei Hou, and Juanzi Li. 2024 a . How proficient are large language models in formal languages? an in-depth insight for knowledge base question answering. In Findings of the Association for Computational Linguistics ACL 2024, pages 792--815

  24. [24]

    Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, et al. 2024 b . Large language models and causal inference in collaboration: A comprehensive survey. arXiv preprint arXiv:2403.09606

  25. [25]

    Massimo Loi and Margarida Rodrigues. 2012. A note on the impact evaluation of public policies: the counterfactual analysis

  26. [26]

    Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30

  27. [27]

    Jing Ma. 2024. Causal inference with large language model: A survey. arXiv preprint arXiv:2409.09822

  28. [28]

    SL Morgan. 2015. Counterfactuals and causal inference. Cambridge University Press

  29. [29]

    Elena Musi and Rudi Palmieri. 2024. The fallacy of explainable generative ai: evidence from argumentative prompting in two domains. In CEUR Workshop Proceedings, volume 3769, pages 59--69

  30. [30]

    Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. 2024. Skeleton-of-thought: Prompting llms for efficient parallel generation. In The Twelfth International Conference on Learning Representations

  31. [31]

    OpenAI. 2024. Models. https://platform.openai.com/docs/models. Accessed: 2025-01-06

  32. [32]

    Judea Pearl. 2009. Causality. Cambridge university press

  33. [33]

    Judea Pearl and Dana Mackenzie. 2018. The book of why: the new science of cause and effect. Basic books

  34. [34]

    Fabio Petroni, Tim Rockt \"a schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066

  35. [35]

    Neil Sahota. 2023. Causal ai: Bridging the gap between correlation and causation. https://www.neilsahota.com https://www.neilsahota.com/causal-ai-bridging-the-gap-between-correlation-and-causation/. Accessed: 2025-01-06

  36. [36]

    Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Ruoxi Jia, and Ming Jin. 2023. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379

  37. [37]

    Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076--3085. PMLR

  38. [38]

    Amit Sharma and Emre Kiciman. 2020. Dowhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216

  39. [39]

    Artur Tarassow. 2023. The potential of llms for coding with low-resource and domain-specific programming languages. arXiv preprint arXiv:2307.13018

  40. [40]

    Vicuna . 2023. Vicuna: An open-source chatbot impressing GPT-4 with 90\ https://vicuna.lmsys.org/. Accessed: 2023

  41. [41]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  42. [42]

    Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In International conference on learning representations

  43. [43]

    Matej Ze c evi \'c , Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. Causal parrots: Large language models may talk causality but are not causal. arXiv preprint arXiv:2308.13067

  44. [44]

    Cheng Zhang, Stefan Bauer, Paul Bennett, Jiangfeng Gao, Wenbo Gong, Agrin Hilmkil, Joel Jennings, Chao Ma, Tom Minka, Nick Pawlowski, et al. 2023. Understanding causality with large language models: Feasibility and opportunities. arXiv preprint arXiv:2304.05524

  45. [45]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  46. [46]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...